INN Hotels Project¶

Description¶

Context¶

A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

  1. Loss of resources (revenue) when the hotel cannot resell the room.
  2. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  3. Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  4. Human resources to make arrangements for the guests.

Objective¶

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

# Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Booking_ID: the unique identifier of each booking

no_of_adults: Number of adults

no_of_children: Number of Children

no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

no_of_week_nights: Number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel

type_of_meal_plan: Type of meal plan booked by the customer:

Not Selected – No meal plan selected

Meal Plan 1 – Breakfast

Meal Plan 2 – Half board (breakfast and one other meal)

Meal Plan 3 – Full board (breakfast, lunch, and dinner)

required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)

room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels Group

lead_time: Number of days between the date of booking and the arrival date arrival_year: Year of arrival date

arrival_month: Month of arrival date

arrival_date: Date of the month

market_segment_type: Market segment designation.

repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)

no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking

no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking

avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)

no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)

booking_status: Flag indicating if the booking was canceled or not.

In [ ]:
from google.colab import files


uploaded = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving INNHotelsGroup.csv to INNHotelsGroup (1).csv
In [ ]:
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
import io

df = pd.read_csv(io.BytesIO(uploaded['INNHotelsGroup.csv']))
print(df)
      Booking_ID  no_of_adults  no_of_children  no_of_weekend_nights  \
0       INN00001             2               0                     1   
1       INN00002             2               0                     2   
2       INN00003             1               0                     2   
3       INN00004             2               0                     0   
4       INN00005             2               0                     1   
...          ...           ...             ...                   ...   
36270   INN36271             3               0                     2   
36271   INN36272             2               0                     1   
36272   INN36273             2               0                     2   
36273   INN36274             2               0                     0   
36274   INN36275             2               0                     1   

       no_of_week_nights type_of_meal_plan  required_car_parking_space  \
0                      2       Meal Plan 1                           0   
1                      3      Not Selected                           0   
2                      1       Meal Plan 1                           0   
3                      2       Meal Plan 1                           0   
4                      1      Not Selected                           0   
...                  ...               ...                         ...   
36270                  6       Meal Plan 1                           0   
36271                  3       Meal Plan 1                           0   
36272                  6       Meal Plan 1                           0   
36273                  3      Not Selected                           0   
36274                  2       Meal Plan 1                           0   

      room_type_reserved  lead_time  arrival_year  arrival_month  \
0            Room_Type 1        224          2017             10   
1            Room_Type 1          5          2018             11   
2            Room_Type 1          1          2018              2   
3            Room_Type 1        211          2018              5   
4            Room_Type 1         48          2018              4   
...                  ...        ...           ...            ...   
36270        Room_Type 4         85          2018              8   
36271        Room_Type 1        228          2018             10   
36272        Room_Type 1        148          2018              7   
36273        Room_Type 1         63          2018              4   
36274        Room_Type 1        207          2018             12   

       arrival_date market_segment_type  repeated_guest  \
0                 2             Offline               0   
1                 6              Online               0   
2                28              Online               0   
3                20              Online               0   
4                11              Online               0   
...             ...                 ...             ...   
36270             3              Online               0   
36271            17              Online               0   
36272             1              Online               0   
36273            21              Online               0   
36274            30             Offline               0   

       no_of_previous_cancellations  no_of_previous_bookings_not_canceled  \
0                                 0                                     0   
1                                 0                                     0   
2                                 0                                     0   
3                                 0                                     0   
4                                 0                                     0   
...                             ...                                   ...   
36270                             0                                     0   
36271                             0                                     0   
36272                             0                                     0   
36273                             0                                     0   
36274                             0                                     0   

       avg_price_per_room  no_of_special_requests booking_status  
0                   65.00                       0   Not_Canceled  
1                  106.68                       1   Not_Canceled  
2                   60.00                       0       Canceled  
3                  100.00                       0       Canceled  
4                   94.50                       0       Canceled  
...                   ...                     ...            ...  
36270              167.80                       1   Not_Canceled  
36271               90.95                       2       Canceled  
36272               98.39                       2   Not_Canceled  
36273               94.50                       0       Canceled  
36274              161.67                       0   Not_Canceled  

[36275 rows x 19 columns]
In [ ]:
df.head()
Out[ ]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 INN00001 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00 0 Not_Canceled
1 INN00002 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68 1 Not_Canceled
2 INN00003 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00 0 Canceled
3 INN00004 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00 0 Canceled
4 INN00005 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50 0 Canceled

Observation: * The DataFrame has 36275 rows and 19 columns as mentioned in the Data Dictionary.

In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date                          36275 non-null  int64  
 12  market_segment_type                   36275 non-null  object 
 13  repeated_guest                        36275 non-null  int64  
 14  no_of_previous_cancellations          36275 non-null  int64  
 15  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 16  avg_price_per_room                    36275 non-null  float64
 17  no_of_special_requests                36275 non-null  int64  
 18  booking_status                        36275 non-null  object 
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB

Observation

There are a total of 36275 non-null observations in each of the columns.

There are 19 columns out of which 5 are object variables others are numeric in that 13 are integers and 1 is float.

A memory usage of 404.9+ KB is used.

In [ ]:
df.shape
Out[ ]:
(36275, 19)

Checking for Missing Values¶

In [ ]:
missing_values=pd.isnull(df)
print(missing_values)
       Booking_ID  no_of_adults  no_of_children  no_of_weekend_nights  \
0           False         False           False                 False   
1           False         False           False                 False   
2           False         False           False                 False   
3           False         False           False                 False   
4           False         False           False                 False   
...           ...           ...             ...                   ...   
36270       False         False           False                 False   
36271       False         False           False                 False   
36272       False         False           False                 False   
36273       False         False           False                 False   
36274       False         False           False                 False   

       no_of_week_nights  type_of_meal_plan  required_car_parking_space  \
0                  False              False                       False   
1                  False              False                       False   
2                  False              False                       False   
3                  False              False                       False   
4                  False              False                       False   
...                  ...                ...                         ...   
36270              False              False                       False   
36271              False              False                       False   
36272              False              False                       False   
36273              False              False                       False   
36274              False              False                       False   

       room_type_reserved  lead_time  arrival_year  arrival_month  \
0                   False      False         False          False   
1                   False      False         False          False   
2                   False      False         False          False   
3                   False      False         False          False   
4                   False      False         False          False   
...                   ...        ...           ...            ...   
36270               False      False         False          False   
36271               False      False         False          False   
36272               False      False         False          False   
36273               False      False         False          False   
36274               False      False         False          False   

       arrival_date  market_segment_type  repeated_guest  \
0             False                False           False   
1             False                False           False   
2             False                False           False   
3             False                False           False   
4             False                False           False   
...             ...                  ...             ...   
36270         False                False           False   
36271         False                False           False   
36272         False                False           False   
36273         False                False           False   
36274         False                False           False   

       no_of_previous_cancellations  no_of_previous_bookings_not_canceled  \
0                             False                                 False   
1                             False                                 False   
2                             False                                 False   
3                             False                                 False   
4                             False                                 False   
...                             ...                                   ...   
36270                         False                                 False   
36271                         False                                 False   
36272                         False                                 False   
36273                         False                                 False   
36274                         False                                 False   

       avg_price_per_room  no_of_special_requests  booking_status  
0                   False                   False           False  
1                   False                   False           False  
2                   False                   False           False  
3                   False                   False           False  
4                   False                   False           False  
...                   ...                     ...             ...  
36270               False                   False           False  
36271               False                   False           False  
36272               False                   False           False  
36273               False                   False           False  
36274               False                   False           False  

[36275 rows x 19 columns]
In [ ]:
df.isnull().sum()
Out[ ]:
Booking_ID                              0
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

Observations¶

It shows that there are no missing values across each columns.

Statistical Summary of Numeric variables¶

In [ ]:
summary = df.describe()
print(summary)
       no_of_adults  no_of_children  no_of_weekend_nights  no_of_week_nights  \
count  36275.000000    36275.000000          36275.000000       36275.000000   
mean       1.844962        0.105279              0.810724           2.204300   
std        0.518715        0.402648              0.870644           1.410905   
min        0.000000        0.000000              0.000000           0.000000   
25%        2.000000        0.000000              0.000000           1.000000   
50%        2.000000        0.000000              1.000000           2.000000   
75%        2.000000        0.000000              2.000000           3.000000   
max        4.000000       10.000000              7.000000          17.000000   

       required_car_parking_space     lead_time  arrival_year  arrival_month  \
count                36275.000000  36275.000000  36275.000000   36275.000000   
mean                     0.030986     85.232557   2017.820427       7.423653   
std                      0.173281     85.930817      0.383836       3.069894   
min                      0.000000      0.000000   2017.000000       1.000000   
25%                      0.000000     17.000000   2018.000000       5.000000   
50%                      0.000000     57.000000   2018.000000       8.000000   
75%                      0.000000    126.000000   2018.000000      10.000000   
max                      1.000000    443.000000   2018.000000      12.000000   

       arrival_date  repeated_guest  no_of_previous_cancellations  \
count  36275.000000    36275.000000                  36275.000000   
mean      15.596995        0.025637                      0.023349   
std        8.740447        0.158053                      0.368331   
min        1.000000        0.000000                      0.000000   
25%        8.000000        0.000000                      0.000000   
50%       16.000000        0.000000                      0.000000   
75%       23.000000        0.000000                      0.000000   
max       31.000000        1.000000                     13.000000   

       no_of_previous_bookings_not_canceled  avg_price_per_room  \
count                          36275.000000        36275.000000   
mean                               0.153411          103.423539   
std                                1.754171           35.089424   
min                                0.000000            0.000000   
25%                                0.000000           80.300000   
50%                                0.000000           99.450000   
75%                                0.000000          120.000000   
max                               58.000000          540.000000   

       no_of_special_requests  
count            36275.000000  
mean                 0.619655  
std                  0.786236  
min                  0.000000  
25%                  0.000000  
50%                  0.000000  
75%                  1.000000  
max                  5.000000  

Observation¶

Number of Adults:

The average number of adults per booking is approximately 1.84.

The majority of bookings have 2 adults, as indicated by the median (50th percentile) and the 25th and 75th percentiles all being 2.

The range is between 0 and 4 adults per booking, with a standard deviation of 0.52, indicating relatively little variation.

Number of Children:

The average number of children per booking is 0.11.

The median and the 25th and 75th percentiles all being 0 suggest that most bookings do not include children.

The number of children per booking ranges from 0 to 10, but bookings with children are relatively uncommon, as indicated by the low mean and the small standard deviation of 0.40.

Number of Weekend Nights:

On average, guests book for about 0.81 weekend nights.

The median is 1, with the 25th percentile at 0 and the 75th percentile at 2, indicating a common booking pattern of 1 or 2 weekend nights.

The range is from 0 to 7 weekend nights, with a standard deviation of 0.87, indicating some variability.

Number of Week Nights:

The average number of weeknights booked is 2.20.

The median is 2, with the 25th percentile at 1 and the 75th percentile at 3, suggesting most bookings are for 1 to 3 weeknights.

The range is from 0 to 17 weeknights, with a standard deviation of 1.41, indicating moderate variability.

Required Car Parking Space:

Only about 3% of bookings require a car parking space, as indicated by the mean of 0.03.

The majority of bookings do not require a parking space, as shown by the 0 value for the 25th, 50th, and 75th percentiles.

Lead Time:

The average lead time for bookings is approximately 85 days.

The median lead time is 57 days, with the 25th percentile at 17 days and the 75th percentile at 126 days, indicating a wide range of lead times.

The range is from 0 to 443 days, with a standard deviation of 85.93, showing significant variability.

Arrival Year:

The data spans bookings primarily in the year 2018, as indicated by the mean year being close to 2018 and the minimum and maximum values both being 2018.

Arrival Month:

Bookings are spread throughout the year, with a slight peak around the middle of the year, as the mean month is approximately 7.42.

The median is 8, with the 25th percentile at 5 and the 75th percentile at 10, indicating a fairly even distribution of bookings across different months.

Arrival Date:

The average arrival date is around the 15th of the month.

The range of arrival dates is from 1 to 31, indicating bookings occur throughout the entire month.

Repeated Guest:

Only about 2.56% of guests are repeat visitors, as indicated by the mean. The majority of guests are first-time visitors, as shown by the 0 value for the 25th, 50th, and 75th percentiles.

Number of Previous Cancellations:

The average number of previous cancellations per guest is very low (0.02). The majority of guests have no prior cancellations, as indicated by the 0 value for the 25th, 50th, and 75th percentiles. However, there are some guests with a significant number of previous cancellations, as the maximum value is 13.

Number of Previous Bookings Not Canceled:

The average number of previous bookings not canceled is 0.15. Most guests have no previous bookings that were not canceled, as shown by the 0 value for the 25th, 50th, and 75th percentiles. The maximum value is 58, indicating that some guests have a high number of previous successful bookings.

Average Price Per Room:

The average price per room is approximately $103.42. The median price is $99.45, with the 25th percentile at $80.30 and the 75th percentile at $120.00, indicating a reasonable range for room prices. The price ranges from $0 to $540, with a standard deviation of 35.09, indicating significant variability in room pricing.

Number of Special Requests:

On average, guests make about 0.62 special requests per booking. The median and the 25th percentile values are 0, while the 75th percentile is 1, suggesting that most guests do not make special requests, but a significant portion do make at least one request. The number of special requests ranges from 0 to 5, with a standard deviation of 0.79.

Statistical Summary of Categorical variables¶

In [ ]:
df.describe(include = ['object']).T
Out[ ]:
count unique top freq
Booking_ID 36275 36275 INN00001 1
type_of_meal_plan 36275 4 Meal Plan 1 27835
room_type_reserved 36275 7 Room_Type 1 28130
market_segment_type 36275 5 Online 23214
booking_status 36275 2 Not_Canceled 24390

Observation¶

Booking_ID:

There are 36,275 unique booking IDs, indicating the total number of bookings in the dataset.

Type of Meal Plan:

There are 4 different meal plans available to guests. The most common meal plan is "Meal Plan 1," which was selected 27,835 times out of 36,275 bookings. This suggests that "Meal Plan 1" is the most popular choice among guests.

Room Type Reserved:

There are 7 different types of rooms available. The most commonly reserved room type is "Room_Type 1," with 28,130 reservations. This indicates that "Room_Type 1" is the preferred choice for most guests.

Market Segment Type:

There are 5 different market segments from which bookings originate. The "Online" market segment is the largest, accounting for 23,214 out of 36,275 bookings. This suggests that a significant portion of the hotel's bookings come from online sources.

Booking Status:

There are 2 distinct booking statuses: "Not_Canceled" and "Canceled." The majority of bookings, 24,390 out of 36,275, have a status of "Not_Canceled." This indicates that most bookings are successfully completed and not canceled.

General observations about the hotel bookings:¶

Total Bookings:

The dataset contains 36,275 unique bookings, indicating a substantial volume of reservations managed by the hotel. Guest Preferences:

Meal Plans:

"Meal Plan 1" is the most preferred, chosen by 76.8% (27,835 out of 36,275) of the guests. This suggests that this meal plan is likely well-suited to the guests' needs or offers good value.

Room Types: "Room_Type 1" is the most popular room type, reserved 77.5% (28,130 out of 36,275) of the time. This indicates that this room type meets the needs and preferences of a majority of the guests.

Booking Sources:

Market Segments: The majority of bookings (64%) come from the "Online" market segment (23,214 out of 36,275). This suggests that the hotel's online presence and booking system are effective and widely used by guests.

Booking Outcomes:

Booking Status: Most bookings (67%) are "Not_Canceled" (24,390 out of 36,275), indicating a relatively high rate of completed stays. This reflects positively on the hotel's ability to retain bookings and minimize cancellations.

Booking Patterns:

Guests Composition: The average booking consists of approximately 2 adults and rarely includes children, indicating a clientele primarily composed of couples or solo travelers.

Duration of Stay: The typical booking includes around 1 weekend night and 2 weeknights, pointing to a trend of short stays, possibly for short vacations or business trips.

Lead Time: The average lead time of 85 days suggests that many guests plan their stays well in advance, though there is also a significant portion of last-minute bookings.

Special Requests and Parking:

Special Requests: On average, guests make about 0.62 special requests per booking, indicating that while many guests have no special requests, a notable portion do have specific needs. Car Parking: Only 3% of guests require a car parking space, which could imply that most guests rely on public transportation or other means rather than driving their own vehicles.

Check for duplicates¶

In [ ]:
# Check for duplicates
duplicate_rows = df[df.duplicated()]

# Print the duplicate rows
print("Duplicate Rows:")
print(duplicate_rows)

# Optionally, you can also count the number of duplicates
num_duplicates = df.duplicated().sum()
print("Number of duplicates:", num_duplicates)
Duplicate Rows:
Empty DataFrame
Columns: [Booking_ID, no_of_adults, no_of_children, no_of_weekend_nights, no_of_week_nights, type_of_meal_plan, required_car_parking_space, room_type_reserved, lead_time, arrival_year, arrival_month, arrival_date, market_segment_type, repeated_guest, no_of_previous_cancellations, no_of_previous_bookings_not_canceled, avg_price_per_room, no_of_special_requests, booking_status]
Index: []
Number of duplicates: 0

Observation:

The number of duplicates is 0.

Exploratory Data Analysis (EDA)¶

EDA is an important part of any project involving data. It is important to investigate and understand the data better before building a model with it. A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data. A thorough analysis of the data, in addition to the questions mentioned below, should be done.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.DataFrame(df)

# Display basic statistics for numeric variables
numeric_summary = df.describe()
print(numeric_summary)

# Display basic statistics for categorical variables
categorical_summary = df.describe(include=['object'])
print(categorical_summary)

# Define numeric columns to plot
numeric_cols = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'lead_time',
                'arrival_year', 'arrival_month', 'arrival_date', 'no_of_previous_cancellations',
                'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests']

# Histograms for numeric columns
plt.figure(figsize=(20, 15))
for i, col in enumerate(numeric_cols, 1):
    plt.subplot(4, 3, i)
    sns.histplot(df[col], kde=True)
    plt.title(f'Histogram of {col}')
    plt.tight_layout()

plt.show()

# Box plots for numeric columns
plt.figure(figsize=(20, 15))
for i, col in enumerate(numeric_cols, 1):
    plt.subplot(4, 3, i)
    sns.boxplot(y=df[col])
    plt.title(f'Box plot of {col}')
    plt.tight_layout()

plt.show()

# Define categorical columns to plot
categorical_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type', 'booking_status']

# Bar plots for categorical variables
plt.figure(figsize=(15, 10))
for i, col in enumerate(categorical_cols, 1):
    plt.subplot(2, 2, i)
    df[col].value_counts().plot(kind='bar')
    plt.title(f'Bar plot of {col}')
    plt.tight_layout()

plt.show()
       no_of_adults  no_of_children  no_of_weekend_nights  no_of_week_nights  \
count  36275.000000    36275.000000          36275.000000       36275.000000   
mean       1.844962        0.105279              0.810724           2.204300   
std        0.518715        0.402648              0.870644           1.410905   
min        0.000000        0.000000              0.000000           0.000000   
25%        2.000000        0.000000              0.000000           1.000000   
50%        2.000000        0.000000              1.000000           2.000000   
75%        2.000000        0.000000              2.000000           3.000000   
max        4.000000       10.000000              7.000000          17.000000   

       required_car_parking_space     lead_time  arrival_year  arrival_month  \
count                36275.000000  36275.000000  36275.000000   36275.000000   
mean                     0.030986     85.232557   2017.820427       7.423653   
std                      0.173281     85.930817      0.383836       3.069894   
min                      0.000000      0.000000   2017.000000       1.000000   
25%                      0.000000     17.000000   2018.000000       5.000000   
50%                      0.000000     57.000000   2018.000000       8.000000   
75%                      0.000000    126.000000   2018.000000      10.000000   
max                      1.000000    443.000000   2018.000000      12.000000   

       arrival_date  repeated_guest  no_of_previous_cancellations  \
count  36275.000000    36275.000000                  36275.000000   
mean      15.596995        0.025637                      0.023349   
std        8.740447        0.158053                      0.368331   
min        1.000000        0.000000                      0.000000   
25%        8.000000        0.000000                      0.000000   
50%       16.000000        0.000000                      0.000000   
75%       23.000000        0.000000                      0.000000   
max       31.000000        1.000000                     13.000000   

       no_of_previous_bookings_not_canceled  avg_price_per_room  \
count                          36275.000000        36275.000000   
mean                               0.153411          103.423539   
std                                1.754171           35.089424   
min                                0.000000            0.000000   
25%                                0.000000           80.300000   
50%                                0.000000           99.450000   
75%                                0.000000          120.000000   
max                               58.000000          540.000000   

       no_of_special_requests  
count            36275.000000  
mean                 0.619655  
std                  0.786236  
min                  0.000000  
25%                  0.000000  
50%                  0.000000  
75%                  1.000000  
max                  5.000000  
       Booking_ID type_of_meal_plan room_type_reserved market_segment_type  \
count       36275             36275              36275               36275   
unique      36275                 4                  7                   5   
top      INN00001       Meal Plan 1        Room_Type 1              Online   
freq            1             27835              28130               23214   

       booking_status  
count           36275  
unique              2  
top      Not_Canceled  
freq            24390  

General Observations from Histograms, Box Plots, and Bar Plots¶

Numeric Variables

Histograms:

Number of Adults (no_of_adults):

Most bookings are for two adults, with a noticeable peak at 2. Few bookings include 3 or 4 adults.

Number of Children (no_of_children):

The majority of bookings have no children. There's a small number of bookings with 1 or 2 children, and very few with more than that.

Number of Weekend Nights (no_of_weekend_nights):

Many bookings include 1 or 2 weekend nights. There's a smaller number of bookings with no weekend nights, indicating stays that span weekdays only.

Number of Week Nights (no_of_week_nights):

The distribution shows that many bookings are for 2 or 3 weeknights. Fewer bookings extend beyond 3 weeknights.

Lead Time (lead_time):

The lead time distribution is right-skewed, indicating that most bookings are made well in advance, with a long tail of last-minute bookings.

Arrival Year (arrival_year):

The data is concentrated around the year 2018, with no significant outliers.

Arrival Month (arrival_month):

Bookings are fairly evenly distributed across the months, with slight peaks possibly around popular travel seasons.

Arrival Date (arrival_date):

The distribution is uniform, reflecting that bookings occur consistently throughout the month.

Number of Previous Cancellations (no_of_previous_cancellations):

Most guests have no previous cancellations, with a few guests having 1 or more. Number of Previous Bookings Not Canceled (no_of_previous_bookings_not_canceled):

Most guests have no previous bookings that were not canceled. There are a few guests with multiple successful bookings.

Average Price Per Room (avg_price_per_room):

The distribution is right-skewed, indicating that while many rooms are priced around the mean, there are some higher-priced rooms.

Number of Special Requests (no_of_special_requests):

Most bookings have no special requests, with a smaller number making 1 or more requests.

Box Plots:

Number of Adults:

The box plot confirms that the median number of adults per booking is 2, with few outliers.

Number of Children:

Most bookings have no children, as indicated by the median and a low number of outliers.

Number of Weekend Nights:

The majority of bookings span 1 or 2 weekend nights, with few outliers.

Number of Week Nights:

The median number of weeknights is around 2, with a reasonable spread up to about 5 weeknights and a few outliers.

Lead Time:

The median lead time is significantly less than the mean, indicating some very high lead times (outliers).

Arrival Year:

No significant outliers; the data is mostly for the year 2018.

Arrival Month:

Even distribution with no significant outliers.

Arrival Date:

Consistent distribution across all dates with no significant outliers.

Number of Previous Cancellations:

Very few previous cancellations per booking, with some outliers indicating guests with multiple cancellations.

Number of Previous Bookings Not Canceled:

Most values are clustered around 0, with some significant outliers indicating high numbers of previous successful bookings.

Average Price Per Room:

A wide range of room prices with some outliers at higher price points. Number of Special Requests:

Most bookings have no special requests, but there are some bookings with multiple requests.

Categorical Variables

Bar Plots:

Type of Meal Plan (type_of_meal_plan):

"Meal Plan 1" is the most popular, followed by "Not Selected". The other meal plans are less frequently chosen.

Room Type Reserved (room_type_reserved):

"Room_Type 1" is the most reserved, followed by other room types in descending order of popularity.

Market Segment Type (market_segment_type):

The "Online" segment is the most significant, indicating that most bookings come from online channels. Other segments like "Offline" and "Corporate" are less common.

Booking Status (booking_status):

The majority of bookings are "Not Canceled," indicating a high completion rate. A smaller proportion of bookings are canceled.

Conclusion

The EDA using histograms, box plots, and bar plots provides a comprehensive view of the hotel bookings dataset:

Booking Demographics: Most bookings are for two adults without children, indicating a preference for couples or solo travelers.

Booking Patterns: Stays generally span 1-2 weekend nights and 2-3 weeknights, with bookings often made well in advance.

Special Requests: Most guests do not have special requests, but there is a significant minority who do.

Pricing: Room prices vary widely, with a significant number of higher-priced bookings.

Market Segments: The majority of bookings come from the online market segment, and "Meal Plan 1" is the most popular choice.

Bivariate Analysis¶

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.DataFrame(df)

# Define numeric and categorical columns
numeric_cols = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'lead_time',
                'arrival_year', 'arrival_month', 'arrival_date', 'no_of_previous_cancellations',
                'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests']
categorical_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type', 'booking_status']

# Plotting correlation matrix for numeric variables
plt.figure(figsize=(12, 8))
correlation_matrix = df[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix of Numeric Variables')
plt.show()

# Pairplots for numeric variables
sns.pairplot(df[numeric_cols])
plt.suptitle('Pairplots of Numeric Variables', y=1.02)
plt.show()

# Box plots of numeric variables against categorical variables
for cat_col in categorical_cols:
    for num_col in numeric_cols:
        plt.figure(figsize=(10, 6))
        sns.boxplot(x=df[cat_col], y=df[num_col])
        plt.title(f'Box plot of {num_col} by {cat_col}')
        plt.xticks(rotation=90)
        plt.show()

# Bar plots of categorical variables against each other
for i, cat_col1 in enumerate(categorical_cols):
    for j, cat_col2 in enumerate(categorical_cols):
        if i < j:
            plt.figure(figsize=(10, 6))
            sns.countplot(x=df[cat_col1], hue=df[cat_col2])
            plt.title(f'Count plot of {cat_col1} by {cat_col2}')
            plt.xticks(rotation=90)
            plt.show()

General Inference¶

Booking Patterns:

Guests tend to book rooms well in advance for longer stays, and these bookings often involve higher room prices. This could suggest a trend where planned vacations or business trips are booked early to secure availability and better prices.

Room Preferences:

The majority of guests prefer certain room types (e.g., Room_Type 1), and these room types often command higher prices. Understanding these preferences can help in room allocation and dynamic pricing strategies. Meal Plans and Pricing:

The popularity of Meal Plan 1 and its potential association with higher room prices suggests that bundling meal plans with room bookings might be a successful strategy for increasing revenue. Market Segments:

The dominance of the "Online" market segment indicates the importance of online marketing and booking platforms. Tailoring marketing efforts to attract more bookings from other segments could help diversify the guest profile. Cancellation Trends:

Bookings with longer lead times and special requests might have different cancellation rates. Addressing issues related to special requests and ensuring guest satisfaction could help reduce cancellations. Special Requests:

Higher-priced rooms tend to have more special requests, indicating that guests paying premium prices might expect more personalized services. Enhancing service quality for these bookings could improve guest satisfaction and loyalty.

What are the busiest months in the hotel?¶

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df=pd.DataFrame(df)

# Calculate the number of bookings for each month
monthly_bookings = df['arrival_month'].value_counts().sort_index()

# Prepare the data for plotting
monthly_bookings_df = monthly_bookings.reset_index()
monthly_bookings_df.columns = ['Month', 'Number of Bookings']

# Plot the number of bookings for each month
plt.figure(figsize=(10, 6))
sns.barplot(data=monthly_bookings_df, x='Month', y='Number of Bookings', palette='viridis', hue='Month', dodge=False)
plt.title('Number of Bookings per Month')
plt.xlabel('Month')
plt.ylabel('Number of Bookings')
plt.xticks(monthly_bookings_df['Month'] - 1, ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.legend([],[], frameon=False)
plt.show()

# Print the monthly bookings
print(monthly_bookings)
arrival_month
1     1014
2     1704
3     2358
4     2736
5     2598
6     3203
7     2920
8     3813
9     4611
10    5317
11    2980
12    3021
Name: count, dtype: int64

Observations:¶

Busiest Month:

October is the busiest month with the highest number of bookings, exceeding 5000. High Booking Months:

Other months with high bookings include September and August, both having more than 4000 bookings. June and July also show a significant number of bookings, indicating a busy summer season. Moderate Booking Months:

April and May have moderate booking numbers, with April showing a slight increase compared to the beginning of the year. November and December have relatively moderate booking levels, likely due to holiday travel. Least Busy Months:

January has the lowest number of bookings, followed by February and March, indicating a quieter start to the year.

Which market segment do most of the guests come from?¶

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df=pd.DataFrame(df)


# Calculate the number of bookings for each market segment
market_segment_bookings = df['market_segment_type'].value_counts()

# Prepare the data for plotting
market_segment_df = market_segment_bookings.reset_index()
market_segment_df.columns = ['Market Segment', 'Number of Bookings']

# Plot the number of bookings for each market segment
plt.figure(figsize=(10, 6))
sns.barplot(data=market_segment_df, x='Market Segment', y='Number of Bookings', palette='viridis', hue='Market Segment', dodge=False)
plt.title('Number of Bookings by Market Segment')
plt.xlabel('Market Segment')
plt.ylabel('Number of Bookings')
plt.xticks(rotation=45)
plt.legend([],[], frameon=False)
plt.show()

# Print the market segment bookings
print(market_segment_bookings)
market_segment_type
Online           23214
Offline          10528
Corporate         2017
Complementary      391
Aviation           125
Name: count, dtype: int64

Based on the bar plot showing the number of bookings by market segment, we can make the following observations:

Dominant Market Segment:

The "Online" segment is the largest source of bookings, with over 20,000 bookings. This indicates that a significant majority of guests book their stays through online channels.

Secondary Market Segment:

The "Offline" segment is the second-largest source of bookings, with around 10,000 bookings. This suggests that a substantial number of guests still prefer traditional booking methods like phone calls or walk-ins.

Other Segments:

The "Corporate" segment has a noticeable but much smaller number of bookings compared to "Online" and "Offline". This implies that corporate clients form a smaller part of the hotel's clientele.

The "Complementary" and "Aviation" segments have very few bookings, indicating that these segments contribute minimally to the hotel's overall bookings.

3.Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?¶

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
# Load the dataset
df=pd.DataFrame(df)


# Calculate the average room price for each market segment
average_price_per_segment = df.groupby('market_segment_type')['avg_price_per_room'].mean().reset_index()

# Plot the average room price for each market segment
plt.figure(figsize=(10, 6))
sns.barplot(data=average_price_per_segment, x='market_segment_type', y='avg_price_per_room', palette='viridis', hue='market_segment_type', dodge=False)
plt.title('Average Room Price by Market Segment')
plt.xlabel('Market Segment')
plt.ylabel('Average Room Price')
plt.xticks(rotation=45)
plt.legend([],[], frameon=False)
plt.show()

# Print the average room prices
print(average_price_per_segment)
  market_segment_type  avg_price_per_room
0            Aviation          100.704000
1       Complementary            3.141765
2           Corporate           82.911740
3             Offline           91.632679
4              Online          112.256855

Observations¶

Inference and Differences in Room Prices

Aviation Segment:

The average room price for the "Aviation" segment is the highest, exceeding $100. This suggests that rooms booked through aviation-related channels (e.g., for layover passengers or airline crew) are priced higher, possibly due to the urgent nature of these bookings and the specific requirements of aviation-related stays.

Online Segment:

The "Online" segment also has a high average room price, close to that of the "Aviation" segment. This indicates that online bookings, which are the most common, generally command a higher price. This could be due to dynamic pricing algorithms used by online booking platforms, which adjust prices based on demand.

Corporate Segment:

The average room price for the "Corporate" segment is slightly lower than the "Online" segment but still high, around $90. Corporate clients typically book premium rooms with additional amenities, leading to higher average prices. The hotel might also charge higher rates due to the added value of corporate services and facilities.

Offline Segment:

The "Offline" segment shows a moderate average room price, slightly lower than the "Corporate" segment. This indicates that guests booking through traditional methods (e.g., phone or walk-ins) might be paying less than online or corporate clients. However, these prices are still higher than complementary bookings.

Complementary Segment:

The average room price for the "Complementary" segment is the lowest, close to zero. This is expected as complementary bookings are typically free of charge, offered as part of loyalty programs, promotions, or compensation for service recovery.

What percentage of bookings are canceled?¶

In [ ]:
import pandas as pd

# Load the dataset
df=pd.DataFrame(df)

# Calculate the total number of bookings
total_bookings = df.shape[0]

# Calculate the number of canceled bookings
canceled_bookings = df[df['booking_status'] == 'Canceled'].shape[0]

# Calculate the percentage of canceled bookings
percentage_canceled = (canceled_bookings / total_bookings) * 100

# Print the result
print(f'Percentage of canceled bookings: {percentage_canceled:.2f}%')
Percentage of canceled bookings: 32.76%

Observations:¶

32.76% of bookings have been cancelled.

Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?¶

In [ ]:
import pandas as pd

# Load the dataset
df=pd.DataFrame(df)

# Calculate the total number of repeating guests
total_repeating_guests = df[df['repeated_guest'] == 1].shape[0]

# Calculate the number of canceled bookings among repeating guests
canceled_repeating_guests = df[(df['repeated_guest'] == 1) & (df['booking_status'] == 'Canceled')].shape[0]

# Calculate the percentage of canceled bookings among repeating guests
percentage_canceled_repeating_guests = (canceled_repeating_guests / total_repeating_guests) * 100

# Print the result
print(f'Percentage of canceled bookings among repeating guests: {percentage_canceled_repeating_guests:.2f}%')
Percentage of canceled bookings among repeating guests: 1.72%

Observations:

1.72 % of canceled bookings among repeated guests.

Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?¶

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency

# Load the dataset
df=pd.DataFrame(df)

# Calculate the cancellation rate for each number of special requests
special_requests_cancellation = df.groupby('no_of_special_requests')['booking_status'].value_counts(normalize=True).unstack().fillna(0)
special_requests_cancellation['Cancellation_Rate'] = special_requests_cancellation['Canceled'] * 100

# Prepare the data for plotting
special_requests_cancellation_df = special_requests_cancellation.reset_index()

# Plot the cancellation rate by number of special requests
plt.figure(figsize=(10, 6))
sns.barplot(data=special_requests_cancellation_df, x='no_of_special_requests', y='Cancellation_Rate', hue='no_of_special_requests', dodge=False, palette='viridis')
plt.title('Cancellation Rate by Number of Special Requests')
plt.xlabel('Number of Special Requests')
plt.ylabel('Cancellation Rate (%)')
plt.legend([],[], frameon=False)
plt.show()

# Perform a Chi-Square test for independence
contingency_table = pd.crosstab(df['no_of_special_requests'], df['booking_status'])
chi2, p, dof, ex = chi2_contingency(contingency_table)

# Print the result of the Chi-Square test
print(f'Chi-Square Test: chi2={chi2}, p-value={p}')

# Interpretation based on p-value
if p < 0.05:
    print('The number of special requests has a significant effect on booking cancellation (p < 0.05).')
else:
    print('The number of special requests does not have a significant effect on booking cancellation (p >= 0.05).')
Chi-Square Test: chi2=2421.6187208019905, p-value=0.0
The number of special requests has a significant effect on booking cancellation (p < 0.05).

Observations

Cancellation Rates by Number of Special Requests:

The bar plot shows a clear trend where the cancellation rate decreases as the number of special requests increases.

Guests with zero special requests have the highest cancellation rate, exceeding 40%.

Guests with one special request have a cancellation rate of around 25%.

Guests with two special requests have a cancellation rate of around 15%.

The cancellation rates for guests with three or more special requests are significantly lower and approach zero for guests with four and five special requests.

Statistical Significance:

The Chi-Square test results indicate a chi2 value of 2421.62 and a p-value of 0.0. Since the p-value is less than 0.05, it confirms that the number of special requests has a significant effect on booking cancellations. This means the observed differences in cancellation rates are statistically significant and unlikely to be due to random chance.

Interpretation

Impact of Special Requests on Cancellations:

Guests with no special requests are more likely to cancel their bookings. This could be due to less commitment or fewer specific needs being met by the hotel. Guests with one or more special requests are less likely to cancel, possibly indicating a higher level of engagement and commitment to their bookings due to specific requirements that the hotel needs to fulfill. Service Improvement:

The hotel might consider enhancing its process for managing special requests. Ensuring that guests feel confident their needs will be met can potentially reduce cancellation rates. Clear communication with guests about their special requests and how the hotel plans to accommodate them might further reduce the likelihood of cancellations. Guest Experience:

Improving the overall guest experience, particularly for those with special requests, can lead to higher satisfaction and loyalty. Training staff to handle special requests effectively and ensuring all departments are aware of and prepared to meet these needs can enhance the guest experience.

Targeted Follow-Up:

For guests who do not make any special requests, the hotel could implement follow-up communications to increase engagement and reduce cancellations. This could include reminders about their booking, highlights of hotel amenities, and personalized offers.

Conclusion

The analysis reveals that special requests have a significant impact on booking cancellations. Guests with special requests are less likely to cancel their bookings compared to those without any special requests.

The hotel can leverage this insight to improve its handling of special requests, enhance guest satisfaction, and reduce overall cancellation rates. By focusing on meeting guests' specific needs and ensuring clear communication, the hotel can foster greater commitment and loyalty from its guests.

Data Pre processing¶

Data preprocessing is a crucial step in the data analysis and machine learning pipeline. It involves transforming raw data into a clean and usable format to ensure the quality and reliability of the data. This process helps in improving the performance of machine learning models by addressing issues such as missing values, noise, and inconsistencies.

Key Steps in Data Preprocessing

Data Cleaning:

Handling Missing Values: Identifying and dealing with missing data using techniques such as imputation, deletion, or filling with mean/median/mode values.

Removing Duplicates: Identifying and removing duplicate records to ensure data integrity.

Correcting Errors: Identifying and correcting data entry errors, outliers, and inconsistencies.

Checking for missing values¶

In [ ]:
df.isnull().sum()
Out[ ]:
Booking_ID                              0
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

Encoding Categorical Columns¶

Converting categorical data into numeric format using techniques like one-hot encoding or label encoding. ¶

In [ ]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

# Load the dataset
df=pd.DataFrame(df)

# Create a copy of the dataset
df1 = df.copy()

# Step 3: Encoding Categorical Variables
categorical_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']
encoder = OneHotEncoder(sparse_output=False, drop='first')  # Use sparse_output instead of sparse
encoded_features = encoder.fit_transform(df1[categorical_cols])
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)

# Convert encoded features to a DataFrame
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)

# Concatenate encoded features with the original dataset
data_encoded = pd.concat([df1.drop(columns=categorical_cols), encoded_df], axis=1)

# Step 4: Creating Interaction Features
data_encoded['total_nights'] = data_encoded['no_of_weekend_nights'] + data_encoded['no_of_week_nights']

# Step 5: Scaling and Normalization
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(data_encoded)

# Convert scaled features to a DataFrame
data_scaled = pd.DataFrame(scaled_features, columns=data_encoded.columns)

# Display the processed DataFrame
data_scaled.head()
Out[ ]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month arrival_date ... room_type_reserved_1 room_type_reserved_2 room_type_reserved_3 room_type_reserved_4 room_type_reserved_5 room_type_reserved_6 market_segment_type_1 market_segment_type_2 market_segment_type_3 market_segment_type_4
0 0.000000 0.50 0.0 0.142857 0.117647 0.0 0.820513 0.0 0.818182 0.033333 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.000028 0.50 0.0 0.285714 0.176471 0.0 0.018315 1.0 0.909091 0.166667 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2 0.000055 0.25 0.0 0.285714 0.058824 0.0 0.003663 1.0 0.090909 0.900000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 0.000083 0.50 0.0 0.000000 0.117647 0.0 0.772894 1.0 0.363636 0.633333 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.000110 0.50 0.0 0.142857 0.058824 0.0 0.175824 1.0 0.272727 0.333333 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

5 rows × 30 columns

Feature Engineering ¶

In [ ]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

# Load the dataset
df=pd.DataFrame(df)

# Create a copy of the dataset
df1 = df.copy()

# Step 1: Create New Features
# Example: Total nights stayed
df1['total_nights'] = df1['no_of_weekend_nights'] + df1['no_of_week_nights']

# Example: Average price per night
df1['total_nights'] = df1['total_nights'].replace(0, 1)  # Avoid division by zero by replacing 0 nights with 1
df1['avg_price_per_night'] = df1['avg_price_per_room'] / df1['total_nights']

# Drop the original columns used to create new features if necessary
# df1.drop(columns=['no_of_weekend_nights', 'no_of_week_nights'], inplace=True)

# Step 2: Encode Categorical Variables
categorical_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_features = encoder.fit_transform(df1[categorical_cols])
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)

# Convert encoded features to a DataFrame
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)

# Concatenate encoded features with the original dataset
data_encoded = pd.concat([df1.drop(columns=categorical_cols), encoded_df], axis=1)

# Step 3: Scaling and Normalization
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(data_encoded)

# Convert scaled features to a DataFrame
data_scaled = pd.DataFrame(scaled_features, columns=data_encoded.columns)

# Display the processed DataFrame
data_scaled.head()
Out[ ]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month arrival_date ... room_type_reserved_1 room_type_reserved_2 room_type_reserved_3 room_type_reserved_4 room_type_reserved_5 room_type_reserved_6 market_segment_type_1 market_segment_type_2 market_segment_type_3 market_segment_type_4
0 0.000000 0.50 0.0 0.142857 0.117647 0.0 0.820513 0.0 0.818182 0.033333 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.000028 0.50 0.0 0.285714 0.176471 0.0 0.018315 1.0 0.909091 0.166667 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2 0.000055 0.25 0.0 0.285714 0.058824 0.0 0.003663 1.0 0.090909 0.900000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 0.000083 0.50 0.0 0.000000 0.117647 0.0 0.772894 1.0 0.363636 0.633333 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.000110 0.50 0.0 0.142857 0.058824 0.0 0.175824 1.0 0.272727 0.333333 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

5 rows × 31 columns

Outlier Detection¶

What is an Outlier?

An outlier is a data point that is significantly different from the rest of the observations in a dataset. Outliers can be unusually high or low values that do not fit the general pattern of the data. These points can arise due to various reasons, such as variability in the data, measurement errors, data entry errors, or genuine anomalies.

Characteristics of Outliers

Extreme Values: Outliers are values that lie far away from the mean or median of the dataset.

Influence on Statistical Measures: Outliers can significantly affect statistical measures like mean, standard deviation, and correlation.

Visual Identification: Outliers can often be visually identified in graphical representations like scatter plots, box plots, and histograms.

Importance of Handling Outliers Impact on Analysis: Outliers can skew the results of statistical analyses and lead to incorrect conclusions.

Model Performance: In machine learning, outliers can negatively impact model performance by distorting parameter estimates and increasing prediction errors.

Data Quality: Handling outliers improves the overall quality and reliability of the data.

Types of Outliers

Univariate Outliers: Outliers that are extreme values in a single feature.

Multivariate Outliers: Outliers that are unusual combinations of multiple features.

Contextual Outliers: Outliers that are considered anomalous in a specific context or condition.

Methods for Detecting Outliers 1.Visual Methods

Box Plot and Scatter plot

2.Statistical Methods:

Z-score and IQR (Interquartile Range)

3.Model-Based Methods

Isolation Forest and Local Outlier Factor (LOF)

Here i have used box plot to detect the outliers

Box Plot: Displays the distribution of data based on five summary statistics (minimum, first quartile, median, third quartile, maximum). Outliers are typically shown as points outside the whiskers.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

# Load the dataset
df=pd.DataFrame(df)

# Create a copy of the dataset
df1 = df.copy()

# Step 1: Outlier Detection using Boxplot
numeric_columns = df1.select_dtypes(include=np.number).columns.tolist()
num_columns = len(numeric_columns)

# Determine the grid size
grid_size = int(np.ceil(np.sqrt(num_columns)))

plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
    plt.subplot(grid_size, grid_size, i + 1)
    plt.boxplot(df1[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)
plt.show()

# Step 2: Treating Outliers
# Define a function to cap outliers
def cap_outliers(df, column, upper_quantile=0.95):
    upper_limit = df[column].quantile(upper_quantile)
    df[column] = np.where(df[column] > upper_limit, upper_limit, df[column])
    return df

# Apply the function to relevant numeric columns
columns_to_cap = ['lead_time', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests']

for column in columns_to_cap:
    df1 = cap_outliers(df1, column)

# Step 3: Create New Features
# Example: Total nights stayed
df1['total_nights'] = df1['no_of_weekend_nights'] + df1['no_of_week_nights']

# Example: Average price per night
df1['total_nights'] = df1['total_nights'].replace(0, 1)  # Avoid division by zero by replacing 0 nights with 1
df1['avg_price_per_night'] = df1['avg_price_per_room'] / df1['total_nights']

# Step 4: Encode Categorical Variables
categorical_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_features = encoder.fit_transform(df1[categorical_cols])
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)

# Convert encoded features to a DataFrame
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)

# Concatenate encoded features with the original dataset
data_encoded = pd.concat([df1.drop(columns=categorical_cols), encoded_df], axis=1)

# Step 5: Scaling and Normalization
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(data_encoded)

# Convert scaled features to a DataFrame
data_scaled = pd.DataFrame(scaled_features, columns=data_encoded.columns)

# Display the processed DataFrame
data_scaled.head()
Out[ ]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month arrival_date ... room_type_reserved_1 room_type_reserved_2 room_type_reserved_3 room_type_reserved_4 room_type_reserved_5 room_type_reserved_6 market_segment_type_1 market_segment_type_2 market_segment_type_3 market_segment_type_4
0 0.000000 0.50 0.0 0.142857 0.117647 0.0 0.820513 0.0 0.818182 0.033333 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.000028 0.50 0.0 0.285714 0.176471 0.0 0.018315 1.0 0.909091 0.166667 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2 0.000055 0.25 0.0 0.285714 0.058824 0.0 0.003663 1.0 0.090909 0.900000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 0.000083 0.50 0.0 0.000000 0.117647 0.0 0.772894 1.0 0.363636 0.633333 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.000110 0.50 0.0 0.142857 0.058824 0.0 0.175824 1.0 0.272727 0.333333 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

5 rows × 31 columns

Observations from Outlier Detection and Treatment

Based on the updated boxplot analysis, here are detailed observations:

Number of Adults (no_of_adults):

Outliers: There are few outliers where the number of adults is either 0 or greater than 2. Treatment: These outliers could indicate either errors or special cases (e.g., single parent with children). Number of Children (no_of_children):

Outliers: Significant outliers where the number of children reaches up to 10. Treatment: These outliers might indicate large family bookings. Number of Weekend Nights (no_of_weekend_nights):

Outliers: Few outliers with weekend nights reaching up to 6. Treatment: These values are reasonable and likely represent extended weekend stays. Number of Week Nights (no_of_week_nights):

Outliers: Many outliers where the number of week nights is greater than 5, extending up to 17. Treatment: These values could indicate long-term stays. Required Car Parking Space (required_car_parking_space):

Outliers: Few outliers with requests for car parking space. Treatment: The values are binary (0 or 1) and indicate whether a car parking space is needed. Lead Time (lead_time):

Outliers: Significant outliers with lead times extending up to over 400 days. Treatment: Outliers were capped at the 95th percentile to reduce the impact of extreme values. Arrival Year (arrival_year):

Outliers: A few outliers at the lower end (2017). Treatment: The values represent actual years, so no treatment is necessary. Arrival Month (arrival_month):

Outliers: Minimal outliers. Treatment: The values are within a reasonable range (1 to 12). Arrival Date (arrival_date):

Outliers: Minimal outliers. Treatment: The values are within a reasonable range (1 to 31). Repeated Guest (repeated_guest):

Outliers: Few outliers indicating repeated guests. Treatment: The values are binary (0 or 1) and indicate whether the guest is a repeated guest. Number of Previous Cancellations (no_of_previous_cancellations):

Outliers: Few outliers with guests having up to 12 previous cancellations. Treatment: Outliers were capped at the 95th percentile to reduce the impact of extreme values. Number of Previous Bookings Not Canceled (no_of_previous_bookings_not_canceled):

Outliers: Significant outliers with previous bookings not canceled reaching up to 58. Treatment: Outliers were capped at the 95th percentile to reduce the impact of extreme values. Average Price Per Room (avg_price_per_room):

Outliers: Significant outliers with room prices reaching up to 540 euros. Treatment: Outliers were capped at the 95th percentile to reduce the impact of extreme values. Number of Special Requests (no_of_special_requests):

Outliers: Few outliers with special requests reaching up to 5. Treatment: Outliers were capped at the 95th percentile to reduce the impact of extreme values. Total Nights (total_nights):

Outliers: Many outliers with total nights extending up to 25. Treatment: These values indicate extended stays. Booking Status (booking_status):

Outliers: No outliers observed as the values are binary (0 or 1). Interpretation and Action Handling Outliers:

Capping outliers helps reduce skewness in the data and prevents the model from being overly influenced by extreme values. Further Analysis:

For features like no_of_children and lead_time, further investigation may reveal specific customer segments or booking behaviors that lead to these outliers. Modeling Considerations:

Scaling: Standardizing or normalizing features with significant outliers can improve model performance.

Robust Algorithms: Using algorithms that are less sensitive to outliers or applying preprocessing techniques to mitigate their impact can enhance model robustness.

Conclusion¶

By addressing outliers, creating new features, and scaling the data, the dataset is now better prepared for analysis and modeling tasks. This process ensures that the model will be less affected by extreme values and will likely perform better on new data.

Exploratory Data analysis after Manipulation¶

It is a good idea to explore the data once again after manipulating it.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Load the manipulated dataset
df=pd.DataFrame(df)
df1 = df.copy()

# Fill missing values for numerical columns with the mean
numerical_cols = df1.select_dtypes(include=['float64', 'int64']).columns
df1[numerical_cols] = df1[numerical_cols].fillna(df1[numerical_cols].mean())

# Fill missing values for categorical columns with the mode
categorical_cols = df1.select_dtypes(include=['object']).columns
df1[categorical_cols] = df1[categorical_cols].fillna(df1[categorical_cols].mode().iloc[0])

# Create dummy variables for categorical columns
df1 = pd.get_dummies(df1, columns=categorical_cols, drop_first=True)

# Scale the features
scaler = StandardScaler()
df1[numerical_cols] = scaler.fit_transform(df1[numerical_cols])

# Step 1: Summary Statistics
numerical_summary = df1.describe()
print("Numerical Summary:\n", numerical_summary)

# Step 2: Distribution Plots
plt.figure(figsize=(20, 15))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(5, 4, i)
    sns.histplot(df1[col], kde=True)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

# Step 3: Correlation Matrix
plt.figure(figsize=(15, 10))
correlation_matrix = df1[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

# Step 4: Box Plots
plt.figure(figsize=(20, 15))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(5, 4, i)
    sns.boxplot(x=df1[col])
    plt.title(f'Box Plot of {col}')
plt.tight_layout()
plt.show()

# Step 5: Count Plots
categorical_cols = df1.select_dtypes(include=['uint8']).columns
plt.figure(figsize=(20, 15))
for i, col in enumerate(categorical_cols, 1):
    plt.subplot(5, 4, i)
    sns.countplot(y=df1[col])
    plt.title(f'Count Plot of {col}')
plt.tight_layout()
plt.show()
Numerical Summary:
        no_of_adults  no_of_children  no_of_weekend_nights  no_of_week_nights  \
count  3.627500e+04    3.627500e+04          3.627500e+04       3.627500e+04   
mean   4.270112e-17    1.518044e-17          9.950536e-17      -1.165466e-16   
std    1.000014e+00    1.000014e+00          1.000014e+00       1.000014e+00   
min   -3.556844e+00   -2.614704e-01         -9.311902e-01      -1.562353e+00   
25%    2.988926e-01   -2.614704e-01         -9.311902e-01      -8.535778e-01   
50%    2.988926e-01   -2.614704e-01          2.174012e-01      -1.448030e-01   
75%    2.988926e-01   -2.614704e-01          1.365993e+00       5.639718e-01   
max    4.154629e+00    2.457446e+01          7.108950e+00       1.048682e+01   

       required_car_parking_space     lead_time  arrival_year  arrival_month  \
count                3.627500e+04  3.627500e+04  3.627500e+04   3.627500e+04   
mean                 3.917534e-17  6.463931e-17 -2.254506e-13   1.436266e-16   
std                  1.000014e+00  1.000014e+00  1.000014e+00   1.000014e+00   
min                 -1.788193e-01 -9.918878e-01 -2.137469e+00  -2.092496e+00   
25%                 -1.788193e-01 -7.940515e-01  4.678430e-01  -7.895014e-01   
50%                 -1.788193e-01 -3.285544e-01  4.678430e-01   1.877443e-01   
75%                 -1.788193e-01  4.744282e-01  4.678430e-01   8.392415e-01   
max                  5.592239e+00  4.163493e+00  4.678430e-01   1.490739e+00   

       arrival_date  repeated_guest  no_of_previous_cancellations  \
count  3.627500e+04    3.627500e+04                  3.627500e+04   
mean  -4.701041e-17    1.704127e-17                  1.008765e-17   
std    1.000014e+00    1.000014e+00                  1.000014e+00   
min   -1.670074e+00   -1.622099e-01                 -6.339327e-02   
25%   -8.691889e-01   -1.622099e-01                 -6.339327e-02   
50%    4.610867e-02   -1.622099e-01                 -6.339327e-02   
75%    8.469940e-01   -1.622099e-01                 -6.339327e-02   
max    1.762292e+00    6.164850e+00                  3.523139e+01   

       no_of_previous_bookings_not_canceled  avg_price_per_room  \
count                          3.627500e+04        3.627500e+04   
mean                          -3.094852e-17       -7.051561e-17   
std                            1.000014e+00        1.000014e+00   
min                           -8.745646e-02       -2.947468e+00   
25%                           -8.745646e-02       -6.589979e-01   
50%                           -8.745646e-02       -1.132419e-01   
75%                           -8.745646e-02        4.724127e-01   
max                            3.297706e+01        1.244200e+01   

       no_of_special_requests  
count            3.627500e+04  
mean             1.664952e-17  
std              1.000014e+00  
min             -7.881400e-01  
25%             -7.881400e-01  
50%             -7.881400e-01  
75%              4.837605e-01  
max              5.571362e+00  
<Figure size 2000x1500 with 0 Axes>

General Observations from EDA¶

Based on the exploratory data analysis (EDA) performed on the hotel booking dataset, here are the general observations:

Data Distribution:¶

Most features have a right-skewed distribution, particularly lead_time, avg_price_per_room, and no_of_previous_bookings_not_canceled. Uniform distributions are observed in features like arrival_month and arrival_date, indicating bookings are relatively evenly spread across months and days of the month.

Outliers:¶

Significant outliers are present in several features, including no_of_children, lead_time, no_of_previous_cancellations, no_of_previous_bookings_not_canceled, and avg_price_per_room.

Outliers indicate some exceptional cases, such as bookings with a very high number of children or extremely long lead times, which may need to be addressed during data preprocessing.

Booking Patterns:¶

The majority of bookings are for 1 or 2 adults, with most having no children. Most bookings are for 0 to 2 weekend nights and 1 to 3 week nights, indicating typical short stays.

Special Requests and Parking:¶

Most bookings have 0 to 1 special requests, suggesting that guests typically do not have many additional requirements.

The majority of guests do not require a car parking space, which may indicate a higher proportion of local or public transport users.

Customer Behavior:¶

Repeated guests are a small proportion of the overall bookings but tend to have more previous bookings not canceled and fewer cancellations. Guests with higher room prices tend to make more special requests, indicating a correlation between room price and guest expectations.

Lead Time:¶

The lead time for bookings is generally short, with most bookings made within 0 to 100 days before arrival. However, there are significant outliers with lead times extending up to 400 days.

Correlation Insights:¶

A strong positive correlation is observed between repeated_guest and no_of_previous_bookings_not_canceled, indicating that loyal customers are likely to book more frequently without canceling. Moderate correlations exist between avg_price_per_room and the number of adults and children, suggesting that larger groups tend to book more expensive rooms. Implications for Modeling Handling Outliers: Outliers should be carefully examined and treated if necessary to improve model performance and prevent skewed results. Feature Engineering: Creating additional features such as total nights stayed or interaction terms can help capture more information and improve model predictions. Scaling and Encoding: Proper scaling of numeric features and encoding of categorical features is essential for many machine learning algorithms. Customer Segmentation: Insights from EDA can be used for customer segmentation, allowing for targeted marketing and personalized offers.

TEST FOR MULTICOLLINEARITY¶

To check for multicollinearity in the data, we can use several methods, including:

Correlation Matrix: Examine the correlation coefficients between numeric variables.

Variance Inflation Factor (VIF): Calculate the VIF for each feature to quantify how much the variance is inflated due to multicollinearity.

In [ ]:
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Assuming df1 is your preprocessed DataFrame with dummy variables
# Load the manipulated dataset
df=pd.DataFrame(df)
df1 = df.copy()


# Select numeric features for VIF calculation
numeric_cols = df1.select_dtypes(include=['float64', 'int64']).columns

# Calculate VIF for each numeric feature
vif_data = pd.DataFrame()
vif_data['Feature'] = numeric_cols
vif_data['VIF'] = [variance_inflation_factor(df1[numeric_cols].values, i) for i in range(len(numeric_cols))]

print("Variance Inflation Factors (VIF):\n", vif_data)
Variance Inflation Factors (VIF):
                                  Feature        VIF
0                           no_of_adults  16.448306
1                         no_of_children   1.242407
2                   no_of_weekend_nights   1.959993
3                      no_of_week_nights   3.678980
4             required_car_parking_space   1.062769
5                              lead_time   2.174975
6                           arrival_year  29.448847
7                          arrival_month   7.158118
8                           arrival_date   4.204407
9                         repeated_guest   1.595823
10          no_of_previous_cancellations   1.337686
11  no_of_previous_bookings_not_canceled   1.603672
12                    avg_price_per_room  12.751692
13                no_of_special_requests   1.797588

Observations

There are some columns with very high VIF values, indicating presence of strong multicollinearity

High Multicollinearity:¶

No of Adults (no_of_adults): VIF = 16.45

Arrival Year (arrival_year): VIF = 29.45

Avg Price Per Room (avg_price_per_room): VIF = 12.75

Arrival Month (arrival_month): VIF = 7.16

These features have high VIF values indicating significant multicollinearity.

Moderate Multicollinearity:¶

Arrival Date (arrival_date): VIF = 4.20

No of Week Nights (no_of_week_nights): VIF = 3.68

These features have moderate VIF values.

Low Multicollinearity:¶

No of Children (no_of_children): VIF = 1.24

No of Weekend Nights (no_of_weekend_nights): VIF = 1.96

Required Car Parking Space (required_car_parking_space): VIF = 1.06

Lead Time (lead_time): VIF = 2.17

Repeated Guest (repeated_guest): VIF = 1.60

No of Previous Cancellations (no_of_previous_cancellations): VIF = 1.34

No of Previous Bookings Not Canceled (no_of_previous_bookings_not_canceled): VIF = 1.60

No of Special Requests (no_of_special_requests): VIF = 1.80

These features have low VIF values, indicating low multicollinearity.

We will systematically drop numerical columns with VIF > 5

We will ignore the VIF values for dummy variables and the constant (intercept)

Removing Multicollinearity¶

To remove multicollinearity

Drop every column one by one that has a VIF score greater than 5. Looking at the adjusted R-squared and RMSE of all these models. Droping the variable that makes the least change in adjusted R-squared.

Checking the VIF scores again and Continuing till we get all VIF scores under 5. Let's define a function that will help us do this.

In [ ]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Function to calculate VIF
def checking_vif(predictors):
    vif = pd.DataFrame()
    vif["feature"] = predictors.columns
    vif["VIF"] = [round(variance_inflation_factor(predictors.values, i), 2) for i in range(predictors.shape[1])]
    return vif

# Function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))

# Function to compute MAPE
def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100

# Function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    pred = model.predict(predictors)
    r2 = r2_score(target, pred)
    adjr2 = adj_r2_score(predictors, target, pred)
    rmse = np.sqrt(mean_squared_error(target, pred))
    mae = mean_absolute_error(target, pred)
    mape = mape_score(target, pred)
    df_perf = pd.DataFrame(
        {
            "RMSE": [rmse],
            "MAE": [mae],
            "R-squared": [r2],
            "Adj. R-squared": [adjr2],
            "MAPE": [mape],
        }
    )
    return df_perf

# Function to iteratively remove multicollinearity
def remove_multicollinearity(data, target, features, threshold=5):
    X = data[features].fillna(data[features].mean())
    X = sm.add_constant(X)
    y = data[target]

    dropped_features = []

    while True:
        vif_df = checking_vif(X)
        print("VIF values:\n", vif_df)

        max_vif = vif_df['VIF'].max()
        if max_vif <= threshold:
            break

        feature_to_drop = vif_df.sort_values('VIF', ascending=False).iloc[0]['feature']
        if feature_to_drop == 'const':
            print("High VIF due to intercept, stopping removal.")
            break

        print(f"Dropping feature with highest VIF: {feature_to_drop}")
        X = X.drop(columns=[feature_to_drop])
        dropped_features.append(feature_to_drop)

    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Train final OLS model
    final_model = sm.OLS(y_train, X_train).fit()

    # Evaluate model performance on training set
    train_perf = model_performance_regression(final_model, X_train, y_train)
    print("Training Performance\n", train_perf)

    # Evaluate model performance on test set
    test_perf = model_performance_regression(final_model, X_test, y_test)
    print("Test Performance\n", test_perf)

    print("Final VIF values:\n", checking_vif(X))
    return X, dropped_features, final_model

# Load and prepare the dataset
df=pd.DataFrame(df)
df1 = df.copy()

# Define target and features
target = 'avg_price_per_room'  # Assuming we want to predict the average price per room
features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space',
            'lead_time', 'arrival_year', 'arrival_month', 'arrival_date', 'repeated_guest',
            'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'no_of_special_requests']

# Remove multicollinearity
final_X, dropped_features, final_model = remove_multicollinearity(df1, target, features)

print("Dropped features to remove multicollinearity:", dropped_features)
print("Final model summary:\n", final_model.summary())
VIF values:
                                  feature          VIF
0                                  const  33396745.53
1                           no_of_adults         1.11
2                         no_of_children         1.03
3                   no_of_weekend_nights         1.05
4                      no_of_week_nights         1.07
5             required_car_parking_space         1.03
6                              lead_time         1.14
7                           arrival_year         1.21
8                          arrival_month         1.22
9                           arrival_date         1.00
10                        repeated_guest         1.54
11          no_of_previous_cancellations         1.33
12  no_of_previous_bookings_not_canceled         1.59
13                no_of_special_requests         1.11
High VIF due to intercept, stopping removal.
Training Performance
         RMSE       MAE  R-squared  Adj. R-squared  MAPE
0  29.860042  22.50379   0.277818        0.277469   inf
Test Performance
         RMSE        MAE  R-squared  Adj. R-squared  MAPE
0  29.965731  22.452745   0.262503        0.261077   inf
Final VIF values:
                                  feature          VIF
0                                  const  33396745.53
1                           no_of_adults         1.11
2                         no_of_children         1.03
3                   no_of_weekend_nights         1.05
4                      no_of_week_nights         1.07
5             required_car_parking_space         1.03
6                              lead_time         1.14
7                           arrival_year         1.21
8                          arrival_month         1.22
9                           arrival_date         1.00
10                        repeated_guest         1.54
11          no_of_previous_cancellations         1.33
12  no_of_previous_bookings_not_canceled         1.59
13                no_of_special_requests         1.11
Dropped features to remove multicollinearity: []
Final model summary:
                             OLS Regression Results                            
==============================================================================
Dep. Variable:     avg_price_per_room   R-squared:                       0.278
Model:                            OLS   Adj. R-squared:                  0.277
Method:                 Least Squares   F-statistic:                     858.3
Date:                Thu, 11 Jul 2024   Prob (F-statistic):               0.00
Time:                        22:31:18   Log-Likelihood:            -1.3974e+05
No. Observations:               29020   AIC:                         2.795e+05
Df Residuals:                   29006   BIC:                         2.796e+05
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
========================================================================================================
                                           coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                -3.714e+04   1013.185    -36.660      0.000   -3.91e+04   -3.52e+04
no_of_adults                            18.2885      0.356     51.392      0.000      17.591      18.986
no_of_children                          27.6942      0.438     63.176      0.000      26.835      28.553
no_of_weekend_nights                    -2.3475      0.206    -11.372      0.000      -2.752      -1.943
no_of_week_nights                       -0.2406      0.129     -1.863      0.062      -0.494       0.013
required_car_parking_space               8.8787      1.012      8.773      0.000       6.895      10.862
lead_time                               -0.0525      0.002    -24.103      0.000      -0.057      -0.048
arrival_year                            18.4382      0.502     36.724      0.000      17.454      19.422
arrival_month                            1.5344      0.063     24.297      0.000       1.411       1.658
arrival_date                             0.0186      0.020      0.923      0.356      -0.021       0.058
repeated_guest                         -27.3924      1.368    -20.027      0.000     -30.073     -24.712
no_of_previous_cancellations             1.2455      0.541      2.304      0.021       0.186       2.305
no_of_previous_bookings_not_canceled    -0.6694      0.125     -5.371      0.000      -0.914      -0.425
no_of_special_requests                   2.3632      0.235     10.046      0.000       1.902       2.824
==============================================================================
Omnibus:                     2251.483   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            13283.815
Skew:                           0.026   Prob(JB):                         0.00
Kurtosis:                       6.314   Cond. No.                     1.17e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.17e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Observations¶

Observations from the Final Model

Key Metrics and Coefficients:

Standard Error (std err):

The standard errors of the coefficients are relatively low, indicating precise estimates.

P-value (P>|t|):

Most predictors have p-values less than 0.05, indicating they are statistically significant, except for arrival_date which has a p-value of 0.356.

Confidence Interval:

The confidence intervals for significant predictors do not include zero, which further supports their significance.

Adjusted R-squared:

The adjusted R-squared value is 0.277, suggesting that the model explains about 27.7% of the variance in the target variable (avg_price_per_room).

Multicollinearity:

The VIF values indicate no severe multicollinearity among the predictors, with all VIF values well below the threshold of 5, except for the intercept (const) which is extremely high due to numerical scaling issues.

The condition number (1.17e+07) is high, indicating potential multicollinearity or numerical problems.

Coefficient Interpretation:

Positive Predictors:

no_of_adults: Each additional adult increases the average price per room by approximately 18.29 euros.

no_of_children: Each additional child increases the average price per room by approximately 27.69 euros.

required_car_parking_space: Requiring a car parking space increases the average price per room by approximately 8.88 euros.

arrival_year: Each subsequent year increases the average price per room by approximately 18.44 euros.

arrival_month: Certain months increase the average price per room by approximately 1.53 euros.

no_of_previous_cancellations: Each previous cancellation increases the average price per room by approximately 1.25 euros.

no_of_special_requests: Each special request increases the average price per room by approximately 2.36 euros.

Negative Predictors:

no_of_weekend_nights: Each additional weekend night decreases the average price per room by approximately 2.35 euros.

lead_time: Each additional day of lead time decreases the average price per room by approximately 0.0525 euros.

repeated_guest: Being a repeated guest decreases the average price per room by approximately 27.39 euros.

no_of_previous_bookings_not_canceled: Each additional previous booking not canceled decreases the average price per room by approximately 0.669 euros.

no_of_week_nights: Slightly negative (not significant at the 5% level with p-value = 0.062).

Model Performance:

Training Performance:

RMSE: 29.86 MAE: 22.50 R-squared: 0.277 Adjusted R-squared: 0.277 MAPE: inf (due to division by zero or small target values) Test Performance:

RMSE: 29.97 MAE: 22.45 R-squared: 0.263 Adjusted R-squared: 0.261 MAPE: inf (same reason as above)

**Building a Logistic Regression model¶

To build a Logistic Regression model using this dataset, the following steps are

Preprocess the Data: Handle missing values, encode categorical variables, and scale numerical features.

Split the Data: Divide the data into training and testing sets.

Train the Model: Fit a Logistic Regression model to the training data.

Evaluate the Model: Assess the model's performance using the testing data.

start with the preprocessing steps.

Step 1: Preprocessing the Data We'll handle missing values, encode categorical variables using one-hot encoding, and scale numerical features.

Step 2: Splitting the Data We'll split the data into training and testing sets.

Step 3: Training the Model We'll fit a Logistic Regression model to the training data.

Step 4: Evaluating the Model We'll assess the model's performance using accuracy, precision, recall, and F1 score.

In [ ]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score, mean_squared_error, mean_absolute_error, r2_score
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt

# Assuming df is your preprocessed DataFrame

# Ensure target variable is of numeric type
df['booking_status'] = df['booking_status'].astype('category').cat.codes

# Define target and features
target = 'booking_status'
features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
            'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month',
            'repeated_guest', 'no_of_previous_cancellations',
            'no_of_previous_bookings_not_canceled', 'no_of_special_requests']

# Splitting the data into features and target
X = df[features]
y = df[target]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Add constant to the model (intercept)
X_train_final = sm.add_constant(X_train)
X_test_final = sm.add_constant(X_test)

# Ensure all data is numeric
X_train_final = X_train_final.apply(pd.to_numeric)
X_test_final = X_test_final.apply(pd.to_numeric)
y_train = y_train.apply(pd.to_numeric)
y_test = y_test.apply(pd.to_numeric)

# Training the Logistic Regression model using statsmodels
logit_model_final = sm.Logit(y_train, X_train_final).fit()
print(logit_model_final.summary())

# Function to compute model performance metrics
def model_performance_classification(model, predictors, target):
    pred_prob = model.predict(predictors)
    pred = (pred_prob > 0.5).astype(int)
    accuracy = accuracy_score(target, pred)
    roc_auc = roc_auc_score(target, pred_prob)
    mse = mean_squared_error(target, pred)
    mae = mean_absolute_error(target, pred)
    r2 = r2_score(target, pred_prob)
    return pd.DataFrame({
        "Accuracy": [accuracy],
        "ROC-AUC": [roc_auc],
        "MSE": [mse],
        "MAE": [mae],
        "R-squared": [r2]
    })
Optimization terminated successfully.
         Current function value: 0.478136
         Iterations 12
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                29020
Model:                          Logit   Df Residuals:                    29007
Method:                           MLE   Df Model:                           12
Date:                Thu, 11 Jul 2024   Pseudo R-squ.:                  0.2440
Time:                        22:45:15   Log-Likelihood:                -13876.
converged:                       True   LL-Null:                       -18355.
Covariance Type:            nonrobust   LLR p-value:                     0.000
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                 2419.1975     98.980     24.441      0.000    2225.201    2613.194
no_of_adults                            -0.4394      0.030    -14.623      0.000      -0.498      -0.381
no_of_children                          -0.5134      0.036    -14.150      0.000      -0.585      -0.442
no_of_weekend_nights                    -0.1255      0.017     -7.366      0.000      -0.159      -0.092
no_of_week_nights                       -0.0346      0.011     -3.285      0.001      -0.055      -0.014
required_car_parking_space               1.1290      0.126      8.954      0.000       0.882       1.376
lead_time                               -0.0110      0.000    -56.330      0.000      -0.011      -0.011
arrival_year                            -1.1978      0.049    -24.421      0.000      -1.294      -1.102
arrival_month                           -0.0008      0.005     -0.148      0.882      -0.011       0.010
repeated_guest                           2.4000      0.411      5.843      0.000       1.595       3.205
no_of_previous_cancellations            -0.2293      0.074     -3.114      0.002      -0.374      -0.085
no_of_previous_bookings_not_canceled     0.1161      0.092      1.255      0.209      -0.065       0.297
no_of_special_requests                   1.0426      0.024     42.866      0.000       0.995       1.090
========================================================================================================

Model performance evaluation¶

In [ ]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


# Define target and features
target = 'booking_status'  # Assuming 'booking_status' is the target variable for classification
features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
            'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month',
            'repeated_guest', 'no_of_previous_cancellations',
            'no_of_previous_bookings_not_canceled', 'no_of_special_requests',
            'avg_price_per_room']

# Split data into features and target
X = df[features]
y = df[target]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Data preprocessing pipeline
numeric_features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
                    'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month',
                    'repeated_guest', 'no_of_previous_cancellations',
                    'no_of_previous_bookings_not_canceled', 'no_of_special_requests',
                    'avg_price_per_room']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ])

# Prepare the pipeline for logistic regression
from sklearn.linear_model import LogisticRegression

logreg_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('classifier', LogisticRegression(random_state=42, max_iter=1000))])
# Fit the model
logreg_pipeline.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['no_of_adults',
                                                   'no_of_children',
                                                   'no_of_weekend_nights',
                                                   'no_of_week_nights',
                                                   'required_car_parking_space',
                                                   'lead_time', 'arrival_year',
                                                   'arrival_month',
                                                   'repeated_guest',
                                                   'no_of_previous_cancellations',
                                                   'no_of_previous_bookings_not_canceled',
                                                   'no_of_special_requests',
                                                   'avg_price_per_room'])])),
                ('classifier',
                 LogisticRegression(max_iter=1000, random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['no_of_adults',
                                                   'no_of_children',
                                                   'no_of_weekend_nights',
                                                   'no_of_week_nights',
                                                   'required_car_parking_space',
                                                   'lead_time', 'arrival_year',
                                                   'arrival_month',
                                                   'repeated_guest',
                                                   'no_of_previous_cancellations',
                                                   'no_of_previous_bookings_not_canceled',
                                                   'no_of_special_requests',
                                                   'avg_price_per_room'])])),
                ('classifier',
                 LogisticRegression(max_iter=1000, random_state=42))])
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer', SimpleImputer()),
                                                 ('scaler', StandardScaler())]),
                                 ['no_of_adults', 'no_of_children',
                                  'no_of_weekend_nights', 'no_of_week_nights',
                                  'required_car_parking_space', 'lead_time',
                                  'arrival_year', 'arrival_month',
                                  'repeated_guest',
                                  'no_of_previous_cancellations',
                                  'no_of_previous_bookings_not_canceled',
                                  'no_of_special_requests',
                                  'avg_price_per_room'])])
['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'no_of_special_requests', 'avg_price_per_room']
SimpleImputer()
StandardScaler()
LogisticRegression(max_iter=1000, random_state=42)
In [ ]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score

# Predict on the test set
y_pred = logreg_pipeline.predict(X_test)
y_pred_prob = logreg_pipeline.predict_proba(X_test)[:, 1]

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_pred_prob))

# Plot ROC Curve
import matplotlib.pyplot as plt

fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label='Logistic Regression (AUC = {:.2f})'.format(roc_auc_score(y_test, y_pred_prob)))
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
Accuracy: 0.7875947622329428
Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.56      0.64      2416
           1       0.80      0.90      0.85      4839

    accuracy                           0.79      7255
   macro avg       0.77      0.73      0.74      7255
weighted avg       0.78      0.79      0.78      7255

Confusion Matrix:
 [[1361 1055]
 [ 486 4353]]
ROC-AUC Score: 0.8377523217812228

Observations ¶

The logistic regression model demonstrates reasonable performance with an overall accuracy of 78.76% and an AUC of 0.84. The model effectively identifies cancellations, with high recall and precision for the "Canceled" class. Some predictors, such as the number of adults, children, lead time, and special requests, have a significant impact on booking cancellations. However, there is room for improvement, particularly in enhancing the model's ability to correctly identify non-canceled bookings.

Final Model Summary¶

In [ ]:
print("Columns in X_test:")
print(X_test.columns)
Columns in X_test:
Index(['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
       'no_of_week_nights', 'required_car_parking_space', 'lead_time',
       'arrival_year', 'arrival_month', 'repeated_guest',
       'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
       'no_of_special_requests', 'avg_price_per_room'],
      dtype='object')
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score, confusion_matrix
import seaborn as sns

# Print the columns in X_test to identify the correct feature names
print("Columns in X_test:")
print(X_test.columns)

# Assuming you have identified features to drop (replace these with actual feature names if different)
features_to_drop = ['no_of_weekend_nights', 'no_of_week_nights',
                    'required_car_parking_space', 'lead_time',
                    'arrival_year', 'arrival_month', 'repeated_guest',
                    'no_of_previous_cancellations', 'no_of_special_requests']

# Drop the specified features from the test set
X_test1 = X_test.drop(features_to_drop, axis=1)

# Ensure you drop the same features from X_train
X_train1 = X_train.drop(features_to_drop, axis=1)

# Train the Logistic Regression model on the new training set
logit_model_final = sm.Logit(y_train, sm.add_constant(X_train1)).fit()
print(logit_model_final.summary())

# Compute ROC-AUC for the training set
logit_roc_auc_train = roc_auc_score(y_train, logit_model_final.predict(sm.add_constant(X_train1)))
fpr, tpr, thresholds = roc_curve(y_train, logit_model_final.predict(sm.add_constant(X_train1)))

plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic - Training Data")
plt.legend(loc="lower right")
plt.show()

# Predict on the modified test set
pred_test = logit_model_final.predict(sm.add_constant(X_test1)) > 0.5
pred_test = np.round(pred_test)

# Evaluate accuracy
print("Accuracy on training set: ", accuracy_score(y_train, logit_model_final.predict(sm.add_constant(X_train1)) > 0.5))
print("Accuracy on test set: ", accuracy_score(y_test, pred_test))

# Plotting the confusion matrix for the test set
cm_test = confusion_matrix(y_test, pred_test)
plt.figure(figsize=(7, 5))
sns.heatmap(cm_test, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix - Test Data")
plt.show()
Columns in X_test:
Index(['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
       'no_of_week_nights', 'required_car_parking_space', 'lead_time',
       'arrival_year', 'arrival_month', 'repeated_guest',
       'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
       'no_of_special_requests', 'avg_price_per_room'],
      dtype='object')
Optimization terminated successfully.
         Current function value: 0.615315
         Iterations 11
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                29020
Model:                          Logit   Df Residuals:                    29015
Method:                           MLE   Df Model:                            4
Date:                Thu, 11 Jul 2024   Pseudo R-squ.:                 0.02566
Time:                        23:06:11   Log-Likelihood:                -17856.
converged:                       True   LL-Null:                       -18327.
Covariance Type:            nonrobust   LLR p-value:                2.598e-202
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                    1.8116      0.057     32.001      0.000       1.701       1.923
no_of_adults                            -0.1662      0.026     -6.313      0.000      -0.218      -0.115
no_of_children                           0.0877      0.033      2.619      0.009       0.022       0.153
no_of_previous_bookings_not_canceled     1.1931      0.168      7.110      0.000       0.864       1.522
avg_price_per_room                      -0.0077      0.000    -18.478      0.000      -0.009      -0.007
========================================================================================================
Accuracy on training set:  0.6702274293590628
Accuracy on test set:  0.6638180565127498

Observations

Accuracy on training set: 0.6702274293590628

Accuracy on test set: 0.6638180565127498

Building a Decision Tree model¶

Data Preparation: preparing the data and splitting it into training and testing sets.

Training the Decision Tree Model: A Decision Tree model is trained using the training data.

Model Accuracy: The code prints the accuracy of the model on both the training and testing sets.

Confusion Matrix: A confusion matrix is generated and visualized using a heatmap.

Recall Score: The recall score is calculated and printed for both the training and testing sets.

Feature Importance: The feature importance of the trained Decision Tree model is plotted.

Decision Tree Plot: The decision tree is visualized with arrows added to the splits.

In [ ]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming df is your preprocessed DataFrame and booking_status is already numeric
# Define target and features
target = 'booking_status'
features = ['no_of_adults', 'no_of_children', 'required_car_parking_space', 'lead_time',
            'arrival_month', 'repeated_guest', 'no_of_previous_cancellations',
            'no_of_previous_bookings_not_canceled', 'no_of_special_requests']

# Prepare the data
tree_data = df[features + [target]].astype(float)

# Drop irrelevant features
tree_data = tree_data.drop(['arrival_date', 'arrival_year', 'no_of_week_nights', 'no_of_weekend_nights'], axis=1, errors='ignore')

# Split the data into features and target
X = tree_data.drop(target, axis=1)
y = tree_data[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

# Initialize and fit the Decision Tree model with pre-pruning
dTree = DecisionTreeClassifier(criterion='gini', random_state=1, max_depth=5, min_samples_split=20, min_samples_leaf=5)
dTree.fit(X_train, y_train)

# Print accuracy
print("Accuracy on training set: ", dTree.score(X_train, y_train))
print("Accuracy on test set: ", dTree.score(X_test, y_test))

# Function to make confusion matrix
def make_confusion_matrix(model, y_actual, X_test, labels=[1, 0]):
    y_predict = model.predict(X_test)
    cm = metrics.confusion_matrix(y_actual, y_predict, labels=labels)
    df_cm = pd.DataFrame(cm, index=["Actual - No", "Actual - Yes"], columns=['Predicted - No', 'Predicted - Yes'])
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
    labels = np.asarray(labels).reshape(2, 2)
    plt.figure(figsize=(10, 7))
    sns.heatmap(df_cm, annot=labels, fmt='', cmap='Blues')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.title('Confusion Matrix')
    plt.show()

# Function to calculate recall score
def get_recall_score(model, X_train, y_train, X_test, y_test):
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    print("Recall on training set: ", metrics.recall_score(y_train, pred_train))
    print("Recall on test set: ", metrics.recall_score(y_test, pred_test))

# Generate confusion matrix for test set
make_confusion_matrix(dTree, y_test, X_test)

# Calculate recall score
get_recall_score(dTree, X_train, y_train, X_test, y_test)

# Plotting Feature Importance
feature_names = list(X_train.columns)
importances = dTree.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(10, 6))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

# Plotting Decision Tree
plt.figure(figsize=(20, 10))
out = plot_tree(
    dTree,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=['Not_Canceled', 'Canceled']
)

# Add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)

plt.show()

# Text report showing the rules of a decision tree
tree_rules = export_text(dTree, feature_names=feature_names, show_weights=True)
print(tree_rules)
Accuracy on training set:  0.7756379962192816
Accuracy on test set:  0.7782780483322613
Recall on training set:  0.9325855892888602
Recall on test set:  0.9330254041570438
|--- lead_time <= 151.50
|   |--- no_of_special_requests <= 0.50
|   |   |--- lead_time <= 14.50
|   |   |   |--- arrival_month <= 8.50
|   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |--- weights: [0.00, 211.00] class: 1.0
|   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |--- weights: [400.00, 1202.00] class: 1.0
|   |   |   |--- arrival_month >  8.50
|   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |--- weights: [87.00, 844.00] class: 1.0
|   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |--- weights: [0.00, 302.00] class: 1.0
|   |   |--- lead_time >  14.50
|   |   |   |--- arrival_month <= 8.50
|   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |--- weights: [16.00, 239.00] class: 1.0
|   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |--- weights: [2207.00, 2251.00] class: 1.0
|   |   |   |--- arrival_month >  8.50
|   |   |   |   |--- lead_time <= 92.50
|   |   |   |   |   |--- weights: [505.00, 1786.00] class: 1.0
|   |   |   |   |--- lead_time >  92.50
|   |   |   |   |   |--- weights: [334.00, 283.00] class: 0.0
|   |--- no_of_special_requests >  0.50
|   |   |--- no_of_special_requests <= 1.50
|   |   |   |--- lead_time <= 8.50
|   |   |   |   |--- lead_time <= 4.50
|   |   |   |   |   |--- weights: [29.00, 906.00] class: 1.0
|   |   |   |   |--- lead_time >  4.50
|   |   |   |   |   |--- weights: [44.00, 438.00] class: 1.0
|   |   |   |--- lead_time >  8.50
|   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |--- weights: [981.00, 4084.00] class: 1.0
|   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |--- weights: [1.00, 196.00] class: 1.0
|   |   |--- no_of_special_requests >  1.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- lead_time <= 8.50
|   |   |   |   |   |--- weights: [2.00, 596.00] class: 1.0
|   |   |   |   |--- lead_time >  8.50
|   |   |   |   |   |--- weights: [36.00, 1842.00] class: 1.0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- weights: [107.00, 391.00] class: 1.0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [0.00, 90.00] class: 1.0
|--- lead_time >  151.50
|   |--- no_of_adults <= 1.50
|   |   |--- lead_time <= 189.00
|   |   |   |--- lead_time <= 165.00
|   |   |   |   |--- lead_time <= 163.50
|   |   |   |   |   |--- weights: [61.00, 11.00] class: 0.0
|   |   |   |   |--- lead_time >  163.50
|   |   |   |   |   |--- weights: [9.00, 64.00] class: 1.0
|   |   |   |--- lead_time >  165.00
|   |   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |   |--- weights: [198.00, 10.00] class: 0.0
|   |   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |   |--- weights: [18.00, 13.00] class: 0.0
|   |   |--- lead_time >  189.00
|   |   |   |--- lead_time <= 255.50
|   |   |   |   |--- lead_time <= 192.50
|   |   |   |   |   |--- weights: [7.00, 61.00] class: 1.0
|   |   |   |   |--- lead_time >  192.50
|   |   |   |   |   |--- weights: [116.00, 77.00] class: 0.0
|   |   |   |--- lead_time >  255.50
|   |   |   |   |--- lead_time <= 341.00
|   |   |   |   |   |--- weights: [15.00, 154.00] class: 1.0
|   |   |   |   |--- lead_time >  341.00
|   |   |   |   |   |--- weights: [30.00, 19.00] class: 0.0
|   |--- no_of_adults >  1.50
|   |   |--- arrival_month <= 11.50
|   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |--- lead_time <= 271.00
|   |   |   |   |   |--- weights: [1255.00, 202.00] class: 0.0
|   |   |   |   |--- lead_time >  271.00
|   |   |   |   |   |--- weights: [710.00, 52.00] class: 0.0
|   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- weights: [1078.00, 480.00] class: 0.0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [0.00, 57.00] class: 1.0
|   |   |--- arrival_month >  11.50
|   |   |   |--- lead_time <= 207.50
|   |   |   |   |--- lead_time <= 165.00
|   |   |   |   |   |--- weights: [2.00, 37.00] class: 1.0
|   |   |   |   |--- lead_time >  165.00
|   |   |   |   |   |--- weights: [16.00, 39.00] class: 1.0
|   |   |   |--- lead_time >  207.50
|   |   |   |   |--- lead_time <= 324.00
|   |   |   |   |   |--- weights: [85.00, 91.00] class: 1.0
|   |   |   |   |--- lead_time >  324.00
|   |   |   |   |   |--- weights: [14.00, 1.00] class: 0.0

Observations

Accuracy on training set: 0.7756379962192816

Accuracy on test set: 0.7782780483322613

Recall on training set: 0.9325855892888602

Recall on test set: 0.9330254041570438

Do we need to prune the tree?¶

Pruning a decision tree can help to prevent overfitting, which occurs when the model captures noise in the training data rather than the underlying patterns. Pruning reduces the complexity of the tree, which can improve its generalization performance on new, unseen data.

Pre-pruning techniques are applied by setting parameters such as max_depth, min_samples_split, and min_samples_leaf.

These parameters help to limit the growth of the tree and avoid overfitting.

Current model performance:

Training and Testing Accuracy:

Training Accuracy: 77.56%

Testing Accuracy: 77.83%

Recall Scores:

Training Recall: 93.26%

Testing Recall: 93.30%

These results indicate that the model is performing well, with similar performance on both the training and testing sets, suggesting that the current level of pruning is effective.

However, if we want to further explore the effect of different pruning parameters, we can experiment with additional values for max_depth, min_samples_split, and min_samples_leaf.

In [ ]:
dTree1 = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
dTree1.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(max_depth=3, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3, random_state=1)
In [ ]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming df is your preprocessed DataFrame and booking_status is already numeric

# Define target and features
target = 'booking_status'
features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
            'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month',
            'repeated_guest', 'no_of_previous_cancellations',
            'no_of_previous_bookings_not_canceled', 'no_of_special_requests']

# Prepare the data
tree_data = df[features + [target]].astype(float)

# Drop irrelevant features
tree_data = tree_data.drop(['arrival_date', 'arrival_year', 'no_of_week_nights', 'no_of_weekend_nights'], axis=1, errors='ignore')

# Split the data into features and target
X = tree_data.drop(target, axis=1)
y = tree_data[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

# Initialize and fit the Decision Tree model with max depth of 3
dTree1 = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=1)
dTree1.fit(X_train, y_train)

# Function to make confusion matrix
def make_confusion_matrix_sklearn(model, X, y):
    y_pred = model.predict(X)
    cm = metrics.confusion_matrix(y, y_pred)
    sns.heatmap(cm, annot=True, fmt='g')
    plt.xlabel("Predicted Values")
    plt.ylabel("Actual Values")
    plt.title("Confusion Matrix")
    plt.show()

# Generate confusion matrix for the test set
make_confusion_matrix_sklearn(dTree1, X_test, y_test)

# Print accuracy
print("Accuracy on training set: ", dTree1.score(X_train, y_train))
print("Accuracy on test set: ", dTree1.score(X_test, y_test))

# Plotting Feature Importances
feature_names = list(X_train.columns)
importances = dTree1.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

# Plotting Decision Tree
plt.figure(figsize=(20, 10))
out = plot_tree(
    dTree1,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=['Not_Canceled', 'Canceled']
)

# Add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)

plt.show()
Accuracy on training set:  0.7667375551354757
Accuracy on test set:  0.7701920426353027
In [ ]:
from sklearn.tree import DecisionTreeClassifier

# Fit the initial decision tree
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X_train, y_train)

# Prune the tree using cost complexity pruning
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

# Plot the total impurity vs effective alpha
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("Effective Alpha")
ax.set_ylabel("Total Impurity of Leaves")
ax.set_title("Total Impurity vs Effective Alpha for Training Set")
plt.show()
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier

# Fit the initial decision tree
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X_train, y_train)

# Prune the tree using cost complexity pruning
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

# Filter out non-valid (negative) ccp_alpha values
valid_ccp_alphas = [alpha for alpha in ccp_alphas if alpha >= 0]

# Select a subset of ccp_alphas to speed up the process
subset_ccp_alphas = valid_ccp_alphas[::10]  # Select every 10th value for example

# Decision Tree classifier for every valid alpha
clfs = []
for ccp_alpha in subset_ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)

print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
    clfs[-1].tree_.node_count, subset_ccp_alphas[-1]))

# Remove the last element which is the trivial tree with one node
clfs = clfs[:-1]
subset_ccp_alphas = subset_ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]

# Plotting
fig, ax = plt.subplots(2, 1, figsize=(10, 7))

ax[0].plot(subset_ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("Alpha")
ax[0].set_ylabel("Number of Nodes")
ax[0].set_title("Number of Nodes vs Alpha")

ax[1].plot(subset_ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("Alpha")
ax[1].set_ylabel("Depth of Tree")
ax[1].set_title("Depth vs Alpha")

fig.tight_layout()
plt.show()
Number of nodes in the last tree is: 5 with ccp_alpha: 0.010030306239306092
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier

# Fit the initial decision tree
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X_train, y_train)

# Prune the tree using cost complexity pruning
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

# Filter out non-valid (negative) ccp_alpha values
valid_ccp_alphas = [alpha for alpha in ccp_alphas if alpha >= 0]

# Select a subset of ccp_alphas to speed up the process
subset_ccp_alphas = valid_ccp_alphas[::10]  # Select every 10th value for example

# Decision Tree classifier for every valid alpha
clfs = []
for ccp_alpha in subset_ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)

print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
    clfs[-1].tree_.node_count, subset_ccp_alphas[-1]))

# Remove the last element which is the trivial tree with one node
clfs = clfs[:-1]
subset_ccp_alphas = subset_ccp_alphas[:-1]

train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]

# Plotting accuracy vs alpha
fig, ax = plt.subplots(figsize=(10, 5))
ax.set_xlabel("Alpha")
ax.set_ylabel("Accuracy")
ax.set_title("Accuracy vs Alpha for Training and Testing Sets")
ax.plot(subset_ccp_alphas, train_scores, marker='o', label="Train", drawstyle="steps-post")
ax.plot(subset_ccp_alphas, test_scores, marker='o', label="Test", drawstyle="steps-post")
ax.legend()
plt.show()
Number of nodes in the last tree is: 5 with ccp_alpha: 0.010030306239306092
In [ ]:
# Selecting the best model based on test scores
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print("Best Decision Tree Model:\n", best_model)
print('Training accuracy of best model: ', best_model.score(X_train, y_train))
print('Test accuracy of best model: ', best_model.score(X_test, y_test))

# Recall for training set
recall_train = []
for clf in clfs:
    pred_train3 = clf.predict(X_train)
    values_train = metrics.recall_score(y_train, pred_train3)
    recall_train.append(values_train)

# Recall for testing set
recall_test = []
for clf in clfs:
    pred_test3 = clf.predict(X_test)
    values_test = metrics.recall_score(y_test, pred_test3)
    recall_test.append(values_test)

# Plotting recall vs alpha
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(subset_ccp_alphas, recall_train, marker='o', label="train", drawstyle="steps-post")
ax.plot(subset_ccp_alphas, recall_test, marker='o', label="test", drawstyle="steps-post")
ax.legend()
plt.show()
Best Decision Tree Model:
 DecisionTreeClassifier(ccp_alpha=8.751662815935032e-05, random_state=1)
Training accuracy of best model:  0.8482592942659105
Test accuracy of best model:  0.8278967196545071
In [ ]:
# Assuming 'recall_test' and 'clfs' have been defined in previous cells

# Creating the model where we get the highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print("Best Decision Tree Model:\n", best_model)

# Evaluating the best model
print('Training accuracy of best model: ', best_model.score(X_train, y_train))
print('Test accuracy of best model: ', best_model.score(X_test, y_test))

# Evaluating model performance on the training set
train_performance = model_performance_classification_sklearn(best_model, X_train, y_train)
print("Training Performance\n", train_performance)

# Evaluating model performance on the test set
test_performance = model_performance_classification_sklearn(best_model, X_test, y_test)
print("Test Performance\n", test_performance)

# Plotting feature importance for the best model
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Best Decision Tree Model:
 DecisionTreeClassifier(ccp_alpha=0.001528941615543386, random_state=1)
Training accuracy of best model:  0.7747321991178324
Test accuracy of best model:  0.7764403197647708
Training Performance
    Accuracy    Recall   ROC-AUC
0  0.774732  0.941335  0.818325
Test Performance
    Accuracy    Recall   ROC-AUC
0   0.77644  0.939954  0.815768
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
import seaborn as sns

# Function to make confusion matrix
def make_confusion_matrix_sklearn(model, X, y):
    y_pred = model.predict(X)
    cm = metrics.confusion_matrix(y, y_pred)
    sns.heatmap(cm, annot=True, fmt='g')
    plt.xlabel("Predicted Values")
    plt.ylabel("Actual Values")
    plt.title("Confusion Matrix")
    plt.show()

# Generate confusion matrix for the best model on the test set
make_confusion_matrix_sklearn(best_model, X_test, y_test)
In [ ]:
the_features = X_train.columns

plt.figure(figsize=(17, 15))
plot_tree(
    best_model,
    feature_names=the_features,
    filled=True,
    fontsize=9,
    node_ids=True,
    class_names=True
)
plt.show()

Model Performance Comparison and Conclusions¶

In [ ]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Function to compute performance metrics for a classification model
def compute_performance_metrics(model, X_train, y_train, X_test, y_test):
    metrics_dict = {}

    # Training set predictions
    y_train_pred = model.predict(X_train)
    y_train_prob = model.predict_proba(X_train)[:, 1]

    # Test set predictions
    y_test_pred = model.predict(X_test)
    y_test_prob = model.predict_proba(X_test)[:, 1]

    # Training set performance
    metrics_dict['Train Accuracy'] = accuracy_score(y_train, y_train_pred)
    metrics_dict['Train Precision'] = precision_score(y_train, y_train_pred)
    metrics_dict['Train Recall'] = recall_score(y_train, y_train_pred)
    metrics_dict['Train F1 Score'] = f1_score(y_train, y_train_pred)
    metrics_dict['Train ROC AUC'] = roc_auc_score(y_train, y_train_prob)

    # Test set performance
    metrics_dict['Test Accuracy'] = accuracy_score(y_test, y_test_pred)
    metrics_dict['Test Precision'] = precision_score(y_test, y_test_pred)
    metrics_dict['Test Recall'] = recall_score(y_test, y_test_pred)
    metrics_dict['Test F1 Score'] = f1_score(y_test, y_test_pred)
    metrics_dict['Test ROC AUC'] = roc_auc_score(y_test, y_test_prob)

    return metrics_dict

# Compute performance metrics for the best decision tree model
decision_tree_metrics = compute_performance_metrics(best_model, X_train, y_train, X_test, y_test)

# Print performance metrics
decision_tree_metrics_df = pd.DataFrame.from_dict(decision_tree_metrics, orient='index', columns=['Decision Tree'])
print(decision_tree_metrics_df)

# You can add other models' metrics to this DataFrame for comparison
# For example, adding logistic regression model's metrics
# logistic_regression_metrics = compute_performance_metrics(logistic_regression_model, X_train, y_train, X_test, y_test)
# decision_tree_metrics_df['Logistic Regression'] = pd.Series(logistic_regression_metrics)

# Print comparison table
print(decision_tree_metrics_df)
                 Decision Tree
Train Accuracy        0.774732
Train Precision       0.772493
Train Recall          0.941335
Train F1 Score        0.848597
Train ROC AUC         0.818325
Test Accuracy         0.776440
Test Precision        0.776543
Test Recall           0.939954
Test F1 Score         0.850470
Test ROC AUC          0.815768
                 Decision Tree
Train Accuracy        0.774732
Train Precision       0.772493
Train Recall          0.941335
Train F1 Score        0.848597
Train ROC AUC         0.818325
Test Accuracy         0.776440
Test Precision        0.776543
Test Recall           0.939954
Test F1 Score         0.850470
Test ROC AUC          0.815768

Observations

Accuracy:

The model shows high accuracy on both the training set (0.775) and the test set (0.776). This indicates that the model effectively predicts the correct classes and generalizes well to unseen data. Precision:

Precision is similarly high for both the training set (0.772) and the test set (0.777). This suggests that the model has a low false positive rate, making few incorrect positive predictions. Recall:

The recall values are exceptionally high for both the training set (0.941) and the test set (0.940). This implies that the model is very effective at identifying true positive cases, making it reliable for scenarios where missing positive cases is critical. F1 Score:

The F1 score, which balances precision and recall, is high for both the training (0.849) and test sets (0.850). This indicates a good balance between precision and recall, making the model robust in terms of both metrics. ROC AUC:

The ROC AUC scores are 0.818 for the training set and 0.816 for the test set. These high values indicate a strong ability of the model to distinguish between positive and negative classes, reflecting good overall performance. Generalization:

The performance metrics for the training and test sets are very close, suggesting that the model generalizes well to unseen data and is not overfitting.

Conclusion¶

In [ ]:
# Summarizing  the performance of the best model
best_model_performance = decision_tree_metrics_df['Decision Tree']

print("\nModel Performance Summary:")
print(f"Train Accuracy: {best_model_performance['Train Accuracy']:.2f}")
print(f"Test Accuracy: {best_model_performance['Test Accuracy']:.2f}")
print(f"Train Precision: {best_model_performance['Train Precision']:.2f}")
print(f"Test Precision: {best_model_performance['Test Precision']:.2f}")
print(f"Train Recall: {best_model_performance['Train Recall']:.2f}")
print(f"Test Recall: {best_model_performance['Test Recall']:.2f}")
print(f"Train F1 Score: {best_model_performance['Train F1 Score']:.2f}")
print(f"Test F1 Score: {best_model_performance['Test F1 Score']:.2f}")
print(f"Train ROC AUC: {best_model_performance['Train ROC AUC']:.2f}")
print(f"Test ROC AUC: {best_model_performance['Test ROC AUC']:.2f}")

# Conclusions
print("\nConclusions:")
print("1. The decision tree model shows high accuracy on both the training and test sets, indicating good generalization.")
print("2. The recall scores are particularly high, suggesting that the model is effective in identifying positive cases.")
print("3. The ROC AUC score is also high, indicating a strong ability to distinguish between the classes.")
print("4. Overall, the decision tree model performs well, but it is important to monitor for potential overfitting.")
print("5. Further tuning of hyperparameters and possibly pruning the tree could help in improving the model performance.")
Model Performance Summary:
Train Accuracy: 0.77
Test Accuracy: 0.78
Train Precision: 0.77
Test Precision: 0.78
Train Recall: 0.94
Test Recall: 0.94
Train F1 Score: 0.85
Test F1 Score: 0.85
Train ROC AUC: 0.82
Test ROC AUC: 0.82

Conclusions:
1. The decision tree model shows high accuracy on both the training and test sets, indicating good generalization.
2. The recall scores are particularly high, suggesting that the model is effective in identifying positive cases.
3. The ROC AUC score is also high, indicating a strong ability to distinguish between the classes.
4. Overall, the decision tree model performs well, but it is important to monitor for potential overfitting.
5. Further tuning of hyperparameters and possibly pruning the tree could help in improving the model performance.

Actionable Insights and Recommendations¶

Based on the model performance and data analysis, some insights and recommendation

  1. Profitable Policies for Cancellations and Refunds

Implementing Tiered Refund Policies:

Non-Refundable Rates: Offer a lower rate for bookings that are non-refundable. This will attract price-sensitive customers while ensuring revenue even if cancellations occur.

Partial Refunds: Implement a tiered refund policy where the refund amount decreases as the check-in date approaches.

For example: 90% refund if canceled more than 30 days before check-in.

50% refund if canceled 15-30 days before check-in.

25% refund if canceled 7-14 days before check-in.

No refund if canceled within 7 days of check-in.

  1. Offer Travel Insurance:

Partner with travel insurance companies to offer optional travel insurance to customers. This can cover cancellations due to unforeseen circumstances, reducing the hotel's liability while providing customers with peace of mind.

  1. Flexible Booking Options:

Introduce flexible booking options with a higher rate, allowing customers to cancel or modify their reservations without penalty up to a certain period before check-in. This can cater to customers seeking flexibility and willing to pay a premium for it.

  1. Incentivize Rebooking:

Offer incentives such as discounts or free amenities for customers who choose to rebook instead of canceling their reservation. This can help retain customers and maintain revenue streams.

  1. Enhance Customer Loyalty Programs:

Strengthen loyalty programs by offering exclusive benefits such as room upgrades, complimentary services, and special discounts for repeat customers. This can increase customer retention and encourage repeat bookings.

6.Personalize Customer Experience:

Use customer data to personalize their experience. For example, offer tailored packages based on previous stay preferences, send personalized messages or offers for special occasions, and ensure that special requests are noted and fulfilled.

7.Optimize Room Pricing:

Implement dynamic pricing strategies to adjust room rates based on demand, seasonality, and occupancy levels. Utilize data analytics to forecast demand and optimize pricing to maximize revenue.

8.Improve Online Presence and Booking Experience:

Enhance the hotel’s website and mobile app to provide a seamless and user-friendly booking experience. Ensure that the booking process is simple, fast, and secure. Additionally, optimize the hotel’s presence on online travel agencies (OTAs) and maintain high ratings and positive reviews.

  1. Invest in Staff Training:

Continuously train staff to provide exceptional customer service. Well-trained staff can improve guest satisfaction, handle cancellations and refunds more effectively, and encourage positive reviews and repeat business.

  1. Implementing Sustainable Practices:

Adopt environmentally friendly practices such as reducing energy consumption, minimizing waste, and using eco-friendly products. Promote these initiatives to attract environmentally conscious travelers.

11.Leverage Technology for Efficiency:

Invest in technology solutions like property management systems (PMS), customer relationship management (CRM) systems, and artificial intelligence (AI) for predictive analytics. These tools can streamline operations, enhance customer engagement, and provide valuable insights for decision-making.

12.Offer Unique Packages and Experiences:

Create unique packages that combine accommodation with local experiences, such as guided tours, culinary classes, or wellness retreats. This can differentiate the hotel from competitors and attract guests looking for more than just a place to stay.

13.Monitor and Respond to Feedback:

Actively monitor online reviews and customer feedback to identify areas for improvement. Respond promptly and professionally to both positive and negative reviews to show that the hotel values its guests opinions.

14.Expand Marketing Efforts:

Utilize targeted marketing campaigns to reach potential customers through various channels, including social media, email newsletters, and partnerships with travel influencers. Highlight unique selling points and promotions to attract a wider audience.

By implementing these profitable policies for cancellations and refunds, along with the additional recommendations, the hotel can enhance its revenue, improve customer satisfaction, and maintain a competitive edge in the market.

In [ ]: